Great video, I enjoyed it! In my eyes the video actually shows how fast C++ is. Unoptimized line by line translation from Python to C++ can be as fast as compiled Python optimized with HPC library.
@BartekLeon-jx5jv it's not a 1D array, but a homogeneous ND array. It's somewhere between vector and int[A][B]. It is represented as a flat array in memory, but unlike int[A][B], the data type, number of dimensions, sizes of these dimensions and the iteration strides are dynamic. Also, it's not just taichi that's using ndarrays, numpy and numba are also using ndarrays here.
I've also had some fun using various methods to speed python up, and this video is a great overview of the major ways of going about it, but while it's a big departure, I've found nim to have the most python-like syntax while being as fast as things get (compiles to c, among many other languages). I've seen that you know about the true power of python already, but James Powell did a great talk about this exact topic titled "Objectionable Content", big recommend. Thanks for the video!
I'll check it out! Also, I have looked at Nim in the past. It seems nice. Eventually I may do another video on this topic, and branch out to other languages (Nim, Julia, and now Mojo). Thanks for the idea, the video rec, and thoughtful comment =]
pypy is a jit for full python with special bindings for numpy and scipy. you can use it for any python code, but for max performance might need to write critical parts of your code in rpython, a subset of python that can be statically compiled to native binary. The example subsequence code is valid rpython btw.
Nice video. I have just learned cython and achieved a speed up of 500x vs pure python(+numpy) in one of my code. It worth to mention that using cython, you can automatically parallyze your loop with prange statement instead of range.
500x is great! And good point on prange-- I should have covered the parallel aspect more of all the solutions (numba, Taichi, and cython) but I glossed over it due to the serial nature of the example problem. Thanks for the comment =]
After reading the other comments while thinking up my own, I feel compelled to echo this sentiment first. Fantastic job, @dougmercer - both technically and visually - I loved it all.
Super nice video my man. I've watched it a few times, I thought you would have several hundred thousand subs when I looked at your channel. Great, great content.
I was already aware of numba, but it's good to see all the others like this. Enjoyable video, and I was happy you showed most of the code, while somehow making it feel like a documentary
Dude, the quality and depth of this video is insane. I feel like I have a deeper understanding of the strengths and limitations of python, and I have been using it for about 7 years. Thank you
Hmm. Hard to say. Could try mypyc-- maybe it'll just magically work. Alternatively, though this might be a bit disruptive, you could swap out CPython with PyPy (a JIT compiled replacement for the CPython interpreter). In the video I'm working on now, PyPy was shockingly convenient and fast.
What are you plotting, out of curiosity? Maybe do a quick sanity check to make sure the amount of data your plotting has exceeded the usefulness of matplotlib. If it's a scatter plot with millions of points, maybe you should use something like datashader or similar
@@dougmercer I'm using it to do histograms for images that have been turned black and white and then converted to 8 bit png files to convert them to stippling.
@@FranLegonIf you’re comparing C/Cpp/Rust/Zig and saying they’re different because of a benchmark you saw you’re j ignorant. they all compile to the LLVM IR nowadays that has its own optimizations
Not really, that JIT compiler can generate better code than the C++ compiler, because of things like automatic vectorization. Obviously, you'd be able to write such code yourself in C++ (which can be quite painful, that's why using a python-like language is so interesting).
True, I should have spent a bit more time optimizing the C++ approach. Here is the compile command I used gist.github.com/dougmercer/1a0fab15abf45d836c2290b98e6c1cd3
@@dougmercer update: I've tried to use std::array for this case, but it only works for small number of n because I, an idiot, forgot that stack memory is severely limited to just some kilobytes🤣🤣🤣
Thanks =] I appreciate it. It's been a slow grind, but the past few days the algorithm has blessed me with some impressions, so I hope it keeps going 🤞
This is a very high quality content, mate! Well done! A question, for gamedev use case, can we just use the tools mentioned to speedup things? I've seen horrible performance when someone is using Python-based game engine (like pygame etc).
Thanks! =] Yes, you should be accelerate a pygame-based game with these tools. You can't speed up pygame functions and methods, but you can speed up your code between those calls. It'll be most well suited for larger, number crunchy parts between methods rather than quick little one-off operations. Let me know if you end up tweaking something and seeing a boost in performance!
Great video! If I could make a suggestion though, the background music is a little too loud. It was hard to follow some of what you were saying because of it (like for the cython stuff for example). Overall awesome video though and learned a lot!
I did not count compilation time for the c++ times, but did include JIT time for the first run of Numba. However, it doesn't play a big impact, because we are typically doing 100s or thousands of runs and adding up their times (so the first run being slow only accounts for a small part of the overall time)
limited branch c code will usually be faster in most applications , but if you want code to be ridiculously fast use assembly. inline assembly is cool too works directly with c. however speed comes at the cost of convenience often
I learned C++ in the last month (came from a Python background!) and tried my luck at coding real-time animations of fractals. I wanted to compare with Python's performance, but now I am scared I learned C++ for nothing... Thanks! (Just kidding I loved learning C++ and I am glad I did. It's super impressive however to see that we can achieve similar performances with these packages in Python! Thanks for the video).
Taichi is great for fractals! I like that it has good built in infrastructure for plotting to a canvas. That said, I'm sure you'll find a use for your new-found C++ knowledge =]
My favourite was numba as we were able to achieve our goal with very little code, there are certain shortcut algorithms that can be applied to makeup for its non applicable functions
Thanks for the analysis, I got motivated to look at numba and cython more carefully. Taichi looked cool, but not having it in the anaconda repo is a negative point for me. Have you tried running this code with TORCH?
Oh interesting, I didn't realize taichi wasn't on conda-forge. I wonder if they'd accept a PR 🤔. For what it's worth, you can pip install it (and that's possible even if you're using an environment.yml). I did not try torch, but I suspect it would very slow. Reason being-- the main use case for torch is parallel computing via tensors. Since this problem is inherently not parallelizable, my guess is it'd be super slow in torch.
in therms of lists, A tuple is way faster as a ordinary list or even a dictionary. you can speed things up if you load things in a tuple instead of a list if you don't have to modify it. Also lock at your bytecode you can already see if something is unnecessary. And dont use Dict unless you have to. That are some of the things to improve runtime. Hey even the order of your functions have an impact.
I'm learning Zig and decided to practice this benchmark and see how fast it could go. If we use the same optimization as the last taichi variant (pre-allocating all necessary memory), it takes 1150ms. If we leave allocation inside (creating and zeroing the matrix, cleaning the memory after we got the result) its about 1800ms (i7-14700k, Ubuntu).
True! Through C/C++ extension libraries, you can directly write/link C/C++ libraries and write your own Python interface to it. Cppyy, ctypes, cffi, pybind11, and Cython are all fair game for this.
If this video ends up getting some more views, maybe I'll do another pass at adding other options. I have a *guess* though... PyPy would speed this up significantly, probably on par with numba. I've heard good things about it *but* it didn't install first try when using conda on my M1 Mac, so I skipped it ¯\_(ツ)_/¯ Nuitka would only speed things up a little bit. From what I've read, nuitka is more so about compatibility (supports *all* python language constructs) and for making standalone, portable builds. For nuitka, speed is secondary to those concerns
That was funny, I did both C++ and Python but now I'm more on C++ side. I had in mind the meme "look what they need to mimic a fraction of our power", I didn't tested it, but I bet If you change the proper compilation options that will be faster again in C++. To my understanding this is what taichi do, it's general SIMD based on your current hardware, under the hood via LLVM optimizer based on the data structure (taichi is tailored for sparse data structure). As you work with dense data Halide would give you [maybe] better results. For all cases the code generated by python front end can be generated by C++, the python will always have an overhead. This is what Machine Learning people do, they don't care about python performances, because all the computation which too 90% of their frame is implemented on CUDA and C++, the python is here only to provide data to lower level system.
> "look what they need to mimic a fraction of our power" Haha, true! In another comment, I said I loved that even if I write terrible C++ it still turns out pretty fast. That said, the same argument could be reversed, if we consider productivity and third party library access. If an application is 95% high level glue and one hot spot, I'd rather write the majority in Python and the hot spot in an AOT or JIT compiled variant of Python than write my entire app in a low level language. The overhead would be worthwhile from a productivity perspective. > Proper compilation flags Do you have flags you want me to try in particular? I did -std=c++11 -O3, but maybe I'm missing something. > SIMD Since this is all sequential, can SIMD help? I thought SIMD was for packing multiple of the same operations in a single instruction (but again, I'm not a C++ dev) > the Python just provides an interface to a lower level language. True! And I'm OK with that! I def agree that well written, native code in a lower level will out-perform generated code from Python. That said, for all but the most trivial algorithms, I can't write well-written C++. So, if I can get even a 95% solution for free from these high level LLVM interfaces, then I'm stoked!
@@dougmercer ( : That remind me a benchmark done by Microsoft, Debug C++ /NoSIMD vs Release C# SIMD, and they notice faster C# :D Yeah sure... The point of Python is not to be faster, it's mostly to be gentle with non-engineer-long-beard programmer, the user are mostly scientist and data-analysts. > Productivity For this example I see no productivity differences between C++ and Python. But personally I'm more productive in C++ with Eigen and few other lib Like an experimented Python will be faster with numpy and his other favorite libs. > Proper compilation flags I don't know what is your compiler, but for Visual Studio: /Ot {favorize speed} /Oi {Inable Intrinsic} To increase the STL speed, Disable C++ expcetion, "Basic Runtime Checks", /GS-, /GR- ... To help intrinsic generation /Zp8 or /Zp16 (here you're processing int), but we can process And based on your hardware /arch:AVX, ... > SIMD You have gather and scatter instruction that could help, need to profile ( : > Improve On both side I'll bet we can performance by using only type you need. If your number cannot go higher than 100 just use a byte/uint8_t, etc. As I said the video was funny, the point is not to say Python is faster than C++, but more "if you're careful you can have performance higher or close to baseline C++"
I'm using g++, I'll try to find the analogs for the compiler flags you recommended. And true, a uint8 is enough. I'll mess around with that too. In any case, thanks for the comments! I'd def like to learn more about C++ but I don't get the opportunity very often
@@chkone007 Ok, I get the point but theres a lot of production code written in python, most code writing does not require performance and the few bits that do you can write a C extension or simply use C++ and python together
@@RicardoSuarezdelValle I kind strongly disagree. Did you ever experienced slow UI, stuttering App, lagging game, ... If yes, you already met a programmer who said "most code writing does not require performance". If you said a code does not require performance that just mean you consider your time more valuable than the user time. As a developper we don't own time, the time is not ours, it's the user time. That's what make the difference between a smooth app, slow and memory heavy software, like everything web based, slack, etc. And all chromium stuff. Most of the devs said It's just a chat app, I don't need C++, just a chromium based. Consequences... My Mac/PC uses 8 GiB for doing nothing, just running a VM. And in a industrial point of view, you can release your startup with python code and saying "how I don't care it's CUDA underthehood". You just expose yourself to have a competitor who implement his stuff on C++/CUDA directly and this competitor will explode his profitability because his AWS bill will be much cheaper. We always require memory efficient and fast code. If none of those argument convience you, consider the CO2 argument, it's more eco-friendly for you PC or your server or your N-instances of your programmer running on AWS. I love python to prototype idea, and accelerate my exploration of ideas, but I cannot be serious with that to my clients. I know lot of "AI startup" are like that, download the model from the researcher, create a docker, build a website => step 2 => profit. Most of them rely on Python, but any competitor with cheaper infrastructure can scale more and be more efficient. I had in mind Facebook developed on PHP fine, cool, but at the beginning each new user cost more than the previous one, ... FB wasn't able to scale. They create "HipHop" compiler from PHP to C++, and now the company became profitable each new user became cheaper than the previous one. Conclusion => Performance always mater. Don't read me wrong, that doesn't mean I over-engineer everything to save 1 byte or 1 pico second in median. But keep in mind the quote "early optimization is the root of evil" was written from a time when everybody was written C and assembly code... The code is different, today with python, javascript, ... "early non-optimization is the root of evil".
10:48 I think it would be nice to define on the left that lower is better (I've usually seen it done in benchmarks). Thank you for the video! About CPP, I think you might've used SIMD instructions.
Good point, I def could have made the metrics interpretation clearer. As for SIMD, it's hard to parallelize this because it's an inherently serial problem (everything requires previous solutions)
@@dougmercer Haskell is my love and I like lambda calculus so I am writing a interpreter and compiler for my own lc implementation for fun. (in haskell)
Im enjoying the video, serious question though. How can jit be faster than c++? Did you have the c++ optimizer on? Nevermind, found a comment where you said that you used -O3. Great work. I feel like anyone who complains about your c++ isn't being fair. While i may have done it another way, its valid
Probably means that I left some performance on the table in the C++, or the JIT pulled some tricks that most people wouldn't pull when writing it natively. Someone else in the comments found that using a flat 1D array gave the C++ a 1.1-1.2x speedup. That probably puts it on par with the Numba/Taichi ndarray approaches That said, the point of the video still stands-- for at least this particular problem, there are several approaches for getting performance on par with native C++
Great video! From what I've tested, your C++ code is good enough. The main bottleneck of your code seems to be the dp result variable. I was able to double the speed (from 3.78832 to 1.77546 seconds) by replacing dp 2D array by two 1D arrays: one "current row" array and "previous row" array, and swapping references around at each iteration. This probably because the code don't have as many cache misses by not fetching new rows of the "dp" array, which are filled by zeros anyway. I did not test this with the Python code, but the same speedup should be obtainable by using two variable (or an tuple of 2 arrays) to keep up with C++.
I am writting a paper about speed in coding, both when coding and executing, and Im comparing C, C++, Rust, GoLang and Python, Python takes the crown in speed to write the program always (Im not taking just bare time it took me, but also amount of characters the program has and complexity of sintax is taken into account), but C it's just perfect when what you need is performance, and in the end, python is just another language based on C. Are world is C, it has always been.
10 หลายเดือนก่อน +1
Very usefull. A quick question, what eas the optimization level for compiling the c++ code. It can really make a diferrence.
I used -O3. Another commenter recommended using a 1D array and handling indexing through arithmetic, and that does speed up the C++ by about 1.1-1.2x. (still pretty similar to the ndarray approach from Taichi) Here's the c++ code and build script if you want to play around with it yourself =] gist.github.com/dougmercer/1a0fab15abf45d836c2290b98e6c1cd3
@@dougmercer thanks! i just managed to get it 169% faster (see fork). still, the speed improvements offered by numba, pyx, and taichi are really impressive :)
Very cool! Yesterday I implemented the 1D index approach (not nearly as cleverly-- just hand jammed the indexing arithmetic in line) and I got about 1.1-1.2x speed up. Does the noexcept make a difference in performance? Or is there something else causing the extra 0.4ish speed up 🤔
Hmm. In past, for bioinformatics, I've used plain Python and made heavy use of bytestrings (instead of Unicode) and memoryviews (to avoid creating copies of the data). You could potentially try PyPy. My latest video used PyPy from file I/O and string manipulation... I suspect it'd be a good choice for bioinformatics. If limited to choices in this video, I suppose I'd pick Cython
Wonderful video, even for a beginner like myself! I wonder if you could share the animation tool you used? I feel it would be awesome for my presentations :))
I do wonder if using normal C++ arrays rather than std::vector would have made the C++ the approach faster. Also, I think it could be faster if dp would also be passed by reference just as a and b are.
Thank you Doug for this awesome video! Btw, just curious: has anyone tried some of this on Pygame? I know Python it's not a common language in the videogame industry, but maybe some of this could bring it some justice (and good surprises).
You can definitely use Cython or Numba to help speed some things up with pygame. I found a few old reddit threads that included demos and discussions by searching "Numba pygame reddit".
I feel like you could get similar performance using lupa + lua_importer or nimporter/nython. Both lua and nim are similar in difficulty to python, though I think nim is somewhat like rust when it comes to how to code it.
Another option - learn Nim. It is an easy to learn language with a pythonic syntax. Because Nim is a compiled language, it's speed is on par with C, C++ and Rust.
Your "faster than C++" claims cannot be taken seriously if you don't show your C++ code, and don't invest the same effort in optimizing that C++ code. EDIT: apologies I missed the C++ slide at 2:22; as expected, this is not a very well written C++ code. Leaving aside the 1-base indexing that wouldn't affect performance, nobody who's writing computation intensive code would allocate a 2d array with vector; the standard way is to use vector with manually computed indexes. Your fastest "python" version is parallelized and uses a globally allocated dp array -- did you try doing the same thing in C++? I say "python" in quotation marks because these tools aren't quite python anymore -- these are dialects that heavily rely on static typing to get their performance claims; while it's the dynamic nature of python that is both it's strength (expressiveness) and weakness (performance). Other things that a performance-oriented C++ dev would do that you may be surprised that they make a difference: replace 'if' with a ternary operator (easier for compilers to optimize), use restrict pointers to access the data in the arrays (helps automatic vectorization), allocate arrays with new/malloc (avoid unnecessary initialization of the arrays). Other things you'd need to disclose: what system and compiler flags you use for testing the C++ code; how did you measure the performance of either of these; what did you do wrt the python's GC. I'd also point out that this case-study analyzes a single function in isolation, which makes it a not very good representative of real-world applications.
@@dougmercer Yes, I did notice that after I posted my comment. The fact is still that that's a very ill-written C++ code which you didn't invest any effort in optimizing. The only valid comparison here is of badly written C++ against idiomatic Python which shows that the C++ is nonetheless 100x faster. Great...
@ybungalobill Addressing your edit-- > "fastest" version was parallelized None of the Python code was parallelized, because the problem is inherently not very parallelizable (each DP entry relies on previous solutions. At best, you could parallelize the "wavefront" of the DP, but I did not do that). I mentioned that taichi *would try to parallelize it* if I didn't explicitly tell it not to. So no parallelization here! > Static DP array It is true that allowing the revised version of the Taichi approach to use a globally allocated DP array is a bit unfair. The other approaches (including C++) would also have benefited from that. However, several of the approaches were faster than my C++ without that optimization, so let's focus on those and ignore the statically allocated Taichi approach. > C++ isn't optimized. You are definitely right that further optimizations could have been made to the C++. At the end of the video, I even admitted it! So, thanks for recommending a few things that could be improved. As an admittedly *terrible* C++ programmer, I don't know what a "restrict pointer" is. I did mess around with a ternary if/else and didn't see a performance difference on my machine. I did not mess around with 1D indexing, because I wanted my implementation to match my Python semi-closely. I explicitly wanted to compare what a Python programmer would write if they were trying to re-write their code in C++, rather than creating some nightmarish SIMD optimized, wavefront parallelized, hand written code to most efficiently solve the LCS problem. That's simply not the code that my audience would write in the circumstances where they would reach for these JIT/AOT-powered tools. If you wanted to write a highly optimized C++ implementation, I'd be happy to test it and include the results in a pinned comment. > These aren't Python anymore. I agree in theory, but disagree in practice. Here's my take-- if I can `pip` install a package, use that package using Python syntax, and easily interoperate with my broader Python project, it's Python-enough for me. Numpy isn't "python", but I consider it "python"-enough. Importing and using numba's `jit` decorator or writing `taichi` is far less burdensome than, say, hand writing a wrapper function in C++ to expose using `ctypes` (which I have done in the past, and hated every minute of). All that said, if you still want to keep your downvote on my video, feel free! Sorry you feel that way, but I guess my video wasn't a good fit for you. In any case, thanks for your feedback!
@ybungalobill "badly written" is one entirely valid way of describing it (and I even admit that-- calling my implementation "naive" here th-cam.com/video/umLZphwA-dw/w-d-xo.html) , but "the C++ code that a Python programmer with little C++ experience would write" is another. As a channel predominantly focused on Python content, my intention was to make a quick, but honest attempt at reimplementing my Python solution in C++, and compare its performance against what I could achieve with JIT/AOT Python libraries. For some reason I couldn't @ you in my other reply, but in my other reply I made a few minutes ago I try to address some of your other feedback.
@@dougmercer Thanks for your comments; looks like we mostly agree here; the difference seems to be that you've limited yourself to a Python dev perspective while I'm looking at the broader picture. While I understand the reasoning behind doing a 1:1 translation of python to C++, it reinforces the incorrect mindset that the difference in languages is purely syntactic. You see, I got a link to this video from a coworker as evidence that "python can be faster than C++, you just need to do these X Y and Z and then magic happens". And at a glance that's what this video communicates. If that's not your goal, you may want to adjust something in your presentation.
So, I'm not a big python guy so I was curious. I repeated your experiment for C++ vs numba. Only real difference: for the C++, I rewrote it just a bit (used auto and changed the indexing a bit to be more c-like) and I wrote the function as a template in which the size m and n were the template variables. This allowed me to change from a vector to a stack allocated array, the main benefit I believe being that the whole memory is contiguous and allowed for better caching. The C++ version was about 1.5x faster than numba on my machine. I really enjoyed this video though! Made my question my biases, and I think there's alot to be said by letting compilers/optimizers do the thinking for you. I think this was really insightful and I think I'm gonna give the numba one a go for many of my future quick projects.
Oh, that's awesome! I think that's the fastest anyone has gotten it so far! Someone else in the comments encouraged me to try a 1D vector of size (m+1)(n+1) and index into it with arithmetic -- that gave me a roughly 1.1-1.2ish x speedup over the original C++ . So, I guess much of the remaining speedup came from data locality-- very cool that it was another 0.3x-ish boost. I'm glad you found the video interesting =]
Really interresting Video. I‘d love to learn more about it. Maybe I will be laughed at for this statement, but even with this video i feel like bringing python to C-Level performance seems to be quite a bit of an effort. Isnt it worth it to learn C/C++ for special tasks? How would you evaluate the developer‘s expirience comparing „Make everything possible with Python“ with „Learning C/C++ or Rust“? Thanks a Lot!
You're right! It's not easy to get C++ performance in Python. I think these tools are appropriate when there are a few "hot spots" in your code, but the majority of your application benefits from Python's ecosystem. It's possible to directly build C extensions and call them from python, but I think these tools are way easier. For some (new) projects, it might make sense to write the whole thing in Rust from the start. In practice, most of my projects use a lot of Python libraries, and my team is not very flexible (they mostly only know Python), so it'd be pretty disruptive if I wrote a critical component in a different language and with different tooling. Good question! (Sorry I don't have a good answer =P)
@@dougmercer that is indeed a good answer, thanks. Since I am working in the Data analysis field (geospatial) I love Python for its possibilities. I was wondering if it makes sense to learn another language for intensive calculations like C++. But think I will try your tools 😊 Many thanks!
for anyone interested, I copyed his c++ code and his example with 30000 elements in each vector, and in my computer it ran in ~25 seconds (my PC is slow). By simply compiling with -Ofast, it got down to ~5 seconds, still without modifying the code at all. I'm not hating in the content of the video, wich in fact is great
@@dougmercer yeah I tought of that, but as (I think, it's been a while since I watched your video, and I really can't check now) you didn't mention anything, I assumed not. Well this makes it more impressive then for python, nice!
You're right, I didn't mention anything. Next time I'll be sure to show these sort of details in video. Was your 25s measurement with no optimization? I wonder what the difference between -O3 and -OFast are for a problem like this. (I'm admittedly not a C++ guy, so OFast is new to me!)
@@dougmercer yeah my 25s was with no optimization, and I guess -O3 and -Ofast shouldn't really make the difference here. As far as I know, -Ofast's main difference from -O3 is that it can change the order of float operations, wich is kinda illegal since a + (b + c) != (a + b) + c for floats, but this code in particular does not have any float calculations so it should be basically the same speed
Hah, *very carefully* in Davinci Resolve (Fusion Page) =P I manually drew the graph using rectangles, then applied (noise + displace) to make it more irregular + (fade it out with noise + the "painterly" effect from Krokodove) to give it the water color appearance + paper texture + adding lens blur One of my favorite animations I've made =]. Thanks for commenting on it
=] I also used Nosferatu in my other video called "Your code is almost entirely untested"... I wonder what it means that I keep putting horror movie clips into my Python explainers 🤔
Hey, thanks for an amazing video! Which one would you suggest so that I can just grab my regular python code with dataclasses and get a performance boost with no tweaks whatsoever?
@@dougmercer Thanks for your answer! I was thinking of something. Nowadays we almost always use type hints because they are great. But only for clarity/type-checkers like mypy. So we are not getting any performance benefit out of it, although I think we could have! Cython translates python to C and forces us to write statically-typed python for that. Which type hints could also be used for... Turns out that Cython supports type hints as well! Then we have stuff like MonkeyType that allows us to automatically type-hint code based on runtime behavior. Nice for annotating legacy code. 1) we write python code with type hints 2) if needed apply MonkeyType to apply them everywhere 3) compile with Cython 4) get a C-like performance I wonder why it's not actually practiced. Do you have any idea?
Mmm, for using type hints to achieve better performance through compilation, I think there's a high level design question: "should your code (1) look/feel like vanilla Python, or (2) are you OK with using non-standard Python features, or (3) are you willing to use syntax that only works in your special language, as long as it still vaguely resembles Python and interoperates with it"? I think mypyc is the closest to achieving the goal of speeding up vanilla Python. cython's python mode is pretty OK, but you need to add extra metadata to make it be performant (e.g., the locals decorator). Cython also has its own type system rather than using Pythons built-in types (e.g., cython.int vs int). Cython as a language (in non-python mode) isn't really Python any more, but interpolates with it well. Some other languages (e.g., Mojo) claim to have a "python-like" syntax and support interacting with Python, but the code isn't really Python.
@@dougmercer Yeah it would be amazing if we could just write vanilla python with standard type hints and compile it with Cython. Apparenly Cython somewhat supports it. TH-cam blocks my commend if I paste a link but you can search this on google: Can Cython use Python type hints? Because todays type hints are everywhere and we don't get any performance benefit out of it at all, which feels weird.
It's hard to say-- when I was experimenting with this problem I remember not observing any speed up when adding vanilla Python typehints, and it wasn't until I started adding things like the @locals decorator that I really noticed any improvement. Let me know if you do any testing that shows a meaningful speed up!
Hmm, I guess the only way would be to write efficient code. I'd profile the code to see what functions are taking the most time, and then focus on improving the slow/frequently called ones Use the right data structures/algorithms. consider using functools.cache to memoize anything that would benefit from caching. reprofile your code after each change to quantify what changes were helpful. You can technically write your own c extensions if your system has a c compiler, but that's probably not what you want.
Thanks se se! =] I think mojo is very cool. That said, from what I know, I believe their license was restrictive for commercial use? Maybe I'll eventually do a follow-up video on it and the other proprietary Python superset that Im failing to recall the name of if this video does well. I also skipped over PyPy, for the sole reason that it failed to install/run on my laptop. ¯\_(ツ)_/¯
if python++ was that good, cython would already be the big thing. I feel like this approach suffers from both worlds: it's harder to understand how a program works compare to python, so nobody uses it instead of python, and it's harder to optimize than c++, so nobody uses it instead of c++.
I've never tried it! Does it work well? I'll have to mess with it sometime 🤔 That said, I am working on a video where I cover one library that I wanted to include in this video (PyPy).
Due to the heavy "readability" over efficiency mentality in the online Python community, which I personally just ignore, many examples of Python code use excessive memory and have redundant and/or inefficient logic. An infamous example is ArcGIS Pro using Python heavily but seems to be written by someone who knew C/C++ and tried to use conventions from C/C++ in Python resulting in hundreds of redundant functions that could be reduced to just a handful. I've also seen this following coding template before too many times: var1 = "something" var2 ="var1 Why?! I wish I was joking. Can someone explain to me why on Earth in any language and/or circumstance you'd want to do this?
Great video ! But why you did not put "ti.init(arch=ti.cuda)" instead of "ti.init(arch=ti.cpu) ????? This way Taichi will use the GPU instead of the CPU and the result is even faster. For instance a for in range(0, 1000000000) loop gives .05123 second instead of 2.4 sec. with C++ on a Nvidia RTX 4060. With "ti.init(arch=ti.cpu) you get "only" 0.081 sec. So 1.59 times longer than on my GPU.
For this problem, the algorithm isn't parallelizable, so using GPU would slow it down. For the fractal visualization I did early in the taichu section, the GPU makes a world of difference!
@@dougmercer Thanks. Taichi certainly looks promising, but I still prefer Numba for its simplicity, i.e. adding a couple of decorators, without altering the code too much. Have you tried Spark and Dask ? They're both parallel programming libraries.
Yup, both are great! Since this problem couldn't be easily parallelized, I didn't mention them. And I agree, in general Numba will be easier than Taichi by a long shot. I just thought Taichi was kind of neat so I included it in the video ¯\_(ツ)_/¯
Yup! gist.github.com/dougmercer/1a0fab15abf45d836c2290b98e6c1cd3 Some people in comments have gotten between 1.1-1.7x speed up through other improvements, but it doesn't really change the narrative much: these compiled Python tools frequently give good enough performance
@@dougmercer thank you! I think is really problem dependent. In some codebases I worked I had for example a 40x speed up over cython or numba by embedding very very small pure C functions using ctypes.
Yes, in my case there were two issues, the first was that cython for some things relies on the python interpreter if data and objects are not managed in the most cythonic way, the second was cache misses. I was working on a kd-tree implementation and a tiny detail on how nodes are managed let me cut out on cache misses during tree traversal. For that purpose I used perf to sample from the process but I know for sure that there are many other options for doing that.
Hey man, very interesting content. Some unsolicited advice that is meant to help and not be mean, but in my opinion all the stock video you use to try and describe every single sentence is a bit distracting and doesn’t add value and background music is a bit loud/unnecessary. Very interesting content though 👍
Hey, thanks for the really polite and sincere feedback! I agree with both points. In more recent videos, I've used less and less stock footage, and I think I've gotten a bit better at mastering my audio to keep my voice easier to hear. Hopefully I keep getting better at this moving forward. Cheers!
But did you put as much effort in your CPP implementation as your python implementation? I love python as much as the next guy and I know a lot of python peeps don't want to write CPP but, at some point you gotta really wonder, "should I just learn CPP?"
You should not use vector of vectors in c++ First of all you will allocate memory m+1 times (for each of inner vector). This is slow. Also this data layout is not cache friendly because each vector will be allocated on its own and whole table is scattered around. What you really should do is define one big (m+1)*(n+1) vector and use this contiguous space as if it has two dimensions like this v[i*m + j] So you skip i rows then select j column. I bet you can easily beat python with this simple modification. Also be sure to compile it with at least -O2 optimization in release configuration so no debug stuff will slow you down at runtime
Another commenter actually already tried a single contiguous vector. They found that -O3 optimizes away any difference in performance. Here's the comment thread where they talked about their attempts th-cam.com/video/umLZphwA-dw/w-d-xo.html&lc=UgyNE2s94tUKjG3hayF4AaABAg.9rJ4vi7-9UyA-ES8Dn0d1t (needs to be opened on desktop) Here's a gist to the implementation and compile command used in the video gist.github.com/dougmercer/1a0fab15abf45d836c2290b98e6c1cd3 So, feel free to let me know if you get a significantly faster -O3 optimized version. If you do, I'll pin your comment.
If you will experiment try to change i and j in v[i*m + j] By changing it you will change memory traversal order from row major to column major. This will change cache misses ratio and resulting speed. You can google for cache friendly data layout to learn more. Those things are very important if you want speed!
Building libraries Length 30000 Sequences, 10 Reps Time using lcs_taichi_ndarray: 23.286619998994865s Time using lcs_taichi_field_once (this one is cheating): 18.95515070901456s Time using original C++: 26.1285s (n_rep=10) Time using O(n) memory C++: 26.7876s (n_rep=10) Time using flattened 2D into 1d C++: 22.3163s (n_rep=10) so, a 1.10-1.20ish speed up, but not enough to meaningfully change the analysis.
@@dougmercer totally might be true because compilers are very smart with optimizations now days. Table is a local variable so compiler is allowed to do basically anything( as long ad observed behavior is not changed). Also difference might be invisible if whole table can fit in cache. I dont remember size you tested. Anyway it is good to know that you already familiar with this little details. If everything is done properly it is really interesting why c++ looses some speed. I should look at your video more carefully…
I had two input arrays of length 30,000. so that would induce a 30,000 by 30,000 matrix. So, kind of big? That said, the 1D indexing did close the gap between the taichi ndarray approach and the C++. So, I don't think it lost any significant speed to taichi. Reason being, the the taichi approach that allocates the field once is unfair (insofar as other approaches could have also made that optimization, but I didn't implement them).
If you're new here, be sure to subscribe! More Python videos coming soon =]
You're very underrated
Mojo and Codon(Exaloop)?
Did so for just the anima-graphics alone. What an IT gigachad you are.
this example is too simple: dynamic types break compiler's optimization, but Your example easily converts to static types ;D
Numba and cython are an easy way to improve performance beyond what most people require for python, and they don't require much boilerplate either.
Absolutely!
@@dougmercer taichi doesn't look very boiler-platy either with just the use of a decorator.
Great video, I enjoyed it! In my eyes the video actually shows how fast C++ is. Unoptimized line by line translation from Python to C++ can be as fast as compiled Python optimized with HPC library.
Absolutely. C/C++ and gcc -O3 is basically magic.
@BartekLeon-jx5jv it's not a 1D array, but a homogeneous ND array. It's somewhere between vector and int[A][B]. It is represented as a flat array in memory, but unlike int[A][B], the data type, number of dimensions, sizes of these dimensions and the iteration strides are dynamic. Also, it's not just taichi that's using ndarrays, numpy and numba are also using ndarrays here.
This channel is highly underrated. Excellent analysis.
Thanks for the support Maks! =]
damn, this is a high effort channel. your stock footage game is especially on point.
hope you pop off big time :)
That's so nice! thanks =] 🤞
I've also had some fun using various methods to speed python up, and this video is a great overview of the major ways of going about it, but while it's a big departure, I've found nim to have the most python-like syntax while being as fast as things get (compiles to c, among many other languages). I've seen that you know about the true power of python already, but James Powell did a great talk about this exact topic titled "Objectionable Content", big recommend. Thanks for the video!
I'll check it out!
Also, I have looked at Nim in the past. It seems nice.
Eventually I may do another video on this topic, and branch out to other languages (Nim, Julia, and now Mojo).
Thanks for the idea, the video rec, and thoughtful comment =]
Nim is great with easy interop with Python!
pypy is a jit for full python with special bindings for numpy and scipy. you can use it for any python code, but for max performance might need to write critical parts of your code in rpython, a subset of python that can be statically compiled to native binary. The example subsequence code is valid rpython btw.
PyPy is fantastic -- I'm actually going to cover it in my next video!
Thank you for making this. Trying out mypyc, cython, and numba right now! :D
Enjoy! And good luck =]
Nice video. I have just learned cython and achieved a speed up of 500x vs pure python(+numpy) in one of my code. It worth to mention that using cython, you can automatically parallyze your loop with prange statement instead of range.
500x is great!
And good point on prange-- I should have covered the parallel aspect more of all the solutions (numba, Taichi, and cython) but I glossed over it due to the serial nature of the example problem.
Thanks for the comment =]
Subbed, nicely done. I can tell you were having fun, IMO don’t worry so much about the glitzy graphics - your story telling is great!
Thanks so much =]
Love this video so much! The quality of content, animation, and visualization is unmatched...
Thank you so much!
After reading the other comments while thinking up my own, I feel compelled to echo this sentiment first.
Fantastic job, @dougmercer - both technically and visually - I loved it all.
Thanks @stereoplegic! That means a lot =]
Super nice video my man. I've watched it a few times, I thought you would have several hundred thousand subs when I looked at your channel. Great, great content.
Thanks so much =] that means a lot!
Hopefully I'll hit 100k subs in another year or two 🤞
I was already aware of numba, but it's good to see all the others like this. Enjoyable video, and I was happy you showed most of the code, while somehow making it feel like a documentary
That's an awesome compliment-- I'm gonna put "Code Documentarian" on my resume.
Thanks for watching and commenting =]
Dude, the quality and depth of this video is insane. I feel like I have a deeper understanding of the strengths and limitations of python, and I have been using it for about 7 years. Thank you
Glad it was helpful =]
Easily the best video I have seen on performance Python, subbed.
Thanks so much! I should have another performance related video out in mid April so see ya then =]
Great video man. I'm going to try this on my web server project that uses numpy quite a lot.
Numba should work great! You may just need to tweak your implementation slightly to use the subset of numpy features supported by Numba.
@@dougmercer have you tried anything that helps with matplotlib?
Hmm. Hard to say.
Could try mypyc-- maybe it'll just magically work.
Alternatively, though this might be a bit disruptive, you could swap out CPython with PyPy (a JIT compiled replacement for the CPython interpreter). In the video I'm working on now, PyPy was shockingly convenient and fast.
What are you plotting, out of curiosity?
Maybe do a quick sanity check to make sure the amount of data your plotting has exceeded the usefulness of matplotlib.
If it's a scatter plot with millions of points, maybe you should use something like datashader or similar
@@dougmercer I'm using it to do histograms for images that have been turned black and white and then converted to 8 bit png files to convert them to stippling.
Brilliant video and useful content. It's a pity there's so few of us... Glad the algorithm suggested this video
Thanks! Glad you found it helpful =]
If it ran faster than your c++ code, there is a problem with your c++ code. It's basically impossible to run faster
Tell that to C and Zig
@@FranLegonthey mean python cannot run faster than C++ not that nothing can
You're right
@@FranLegonIf you’re comparing C/Cpp/Rust/Zig and saying they’re different because of a benchmark you saw you’re j ignorant. they all compile to the LLVM IR nowadays that has its own optimizations
Not really, that JIT compiler can generate better code than the C++ compiler, because of things like automatic vectorization. Obviously, you'd be able to write such code yourself in C++ (which can be quite painful, that's why using a python-like language is so interesting).
Text on the screen was definitely engaging ;) Thanks
Yay! Success =]
Gotta love mypyc, I've been using it in my project and never felt disappointed
First time watching a video from you and really loved the explanation and animation! Keep going 🔥
Thanks! Will do =] recently I've been working on making my own animation library so it's easier for me to keep making these videos. More to come soon!
Yeah.
Try use std::array. Should've match Taichi's init once result.
Also, what type of optimization flag you used to compile the C++? Thanks!
True, I should have spent a bit more time optimizing the C++ approach.
Here is the compile command I used
gist.github.com/dougmercer/1a0fab15abf45d836c2290b98e6c1cd3
@@dougmercer update: I've tried to use std::array for this case, but it only works for small number of n because I, an idiot, forgot that stack memory is severely limited to just some kilobytes🤣🤣🤣
on the other hand, we could use matrix libraries like Eigen which employs a better data structure for this case.
5k subs? I swear I thought you had like 1 million because of how good this video was I'm subscribing
Thanks =] I appreciate it.
It's been a slow grind, but the past few days the algorithm has blessed me with some impressions, so I hope it keeps going 🤞
Well, I use numba in my research, concerning the human genome, it was really fast!
That's awesome! I love numba-- super convenient and fast
Wow! Really informative and interesting - Thank You! I am now a subscriber 😊👍
Thanks so much =]
good content, great presentation. love the style!
Thanks Finnnicus! Much appreciated =]
This is a very high quality content, mate!
Well done!
A question, for gamedev use case, can we just use the tools mentioned to speedup things?
I've seen horrible performance when someone is using Python-based game engine (like pygame etc).
Thanks! =]
Yes, you should be accelerate a pygame-based game with these tools.
You can't speed up pygame functions and methods, but you can speed up your code between those calls. It'll be most well suited for larger, number crunchy parts between methods rather than quick little one-off operations.
Let me know if you end up tweaking something and seeing a boost in performance!
Great video! If I could make a suggestion though, the background music is a little too loud. It was hard to follow some of what you were saying because of it (like for the cython stuff for example). Overall awesome video though and learned a lot!
Thanks! And def agree- I've reduced music levels in later videos. Thanks again for the feedback and comment =]
Very cool video! Did you consider compilation time in C++ tests? I used Numba daily, and the first run is always slow due to the JIT feature.
I did not count compilation time for the c++ times, but did include JIT time for the first run of Numba. However, it doesn't play a big impact, because we are typically doing 100s or thousands of runs and adding up their times (so the first run being slow only accounts for a small part of the overall time)
limited branch c code will usually be faster in most applications , but if you want code to be ridiculously fast use assembly. inline assembly is cool too works directly with c. however speed comes at the cost of convenience often
I learned C++ in the last month (came from a Python background!) and tried my luck at coding real-time animations of fractals. I wanted to compare with Python's performance, but now I am scared I learned C++ for nothing... Thanks! (Just kidding I loved learning C++ and I am glad I did. It's super impressive however to see that we can achieve similar performances with these packages in Python! Thanks for the video).
Taichi is great for fractals! I like that it has good built in infrastructure for plotting to a canvas.
That said, I'm sure you'll find a use for your new-found C++ knowledge =]
My favourite was numba as we were able to achieve our goal with very little code, there are certain shortcut algorithms that can be applied to makeup for its non applicable functions
This channel is a hidden gem
Thanks 💎 =]
That was exceptional. Thank you very much.
Thanks for watching and commenting!
Just excellent in every way. Subbed.
=]
@Dough: Your channel's popularity should be atleast 100x more!!!
Thanks so much! Fingers crossed the channel does grow 100x 🤞. At that point I prob could make videos full time 🤯
Super enjoyable video, thank you this was very helpful!
Thanks! Glad it was helpful!
That's nice of you to point these libraries out.
Thanks!
Thanks for the analysis, I got motivated to look at numba and cython more carefully.
Taichi looked cool, but not having it in the anaconda repo is a negative point for me.
Have you tried running this code with TORCH?
Oh interesting, I didn't realize taichi wasn't on conda-forge. I wonder if they'd accept a PR 🤔. For what it's worth, you can pip install it (and that's possible even if you're using an environment.yml).
I did not try torch, but I suspect it would very slow. Reason being-- the main use case for torch is parallel computing via tensors. Since this problem is inherently not parallelizable, my guess is it'd be super slow in torch.
@@dougmercer Thx for insinghts
in therms of lists, A tuple is way faster as a ordinary list or even a dictionary. you can speed things up if you load things in a tuple instead of a list if you don't have to modify it. Also lock at your bytecode you can already see if something is unnecessary. And dont use Dict unless you have to.
That are some of the things to improve runtime. Hey even the order of your functions have an impact.
I'm learning Zig and decided to practice this benchmark and see how fast it could go. If we use the same optimization as the last taichi variant (pre-allocating all necessary memory), it takes 1150ms. If we leave allocation inside (creating and zeroing the matrix, cleaning the memory after we got the result) its about 1800ms (i7-14700k, Ubuntu).
I wanna learn Zig this year...
Love this video ! it was amzing and usefull !
Thanks so much!
Beautifully done!
Thanks Russell =]
There are also ways to write c++ directly in python i think, for instance cppyy or with torch extension
True! Through C/C++ extension libraries, you can directly write/link C/C++ libraries and write your own Python interface to it. Cppyy, ctypes, cffi, pybind11, and Cython are all fair game for this.
If you don't keep your content consistently uploaded, you'd be committing a felony.
Subbed!!
I'm gonna try! Hahaha
Thanks for subbing =]
Would be interested to see what Pypy and nuitka do for it as well.
If this video ends up getting some more views, maybe I'll do another pass at adding other options.
I have a *guess* though...
PyPy would speed this up significantly, probably on par with numba. I've heard good things about it *but* it didn't install first try when using conda on my M1 Mac, so I skipped it ¯\_(ツ)_/¯
Nuitka would only speed things up a little bit. From what I've read, nuitka is more so about compatibility (supports *all* python language constructs) and for making standalone, portable builds. For nuitka, speed is secondary to those concerns
Great video! Super underrated channel. Love the graphics
Thanks Roshan! Means a ton to hear that =]
The theme, info, ambience and the whole vibe of the video is so good. Subscribed !
That's like the best compliment =] thanks!
crazy good video! I'm gonna check out Taichi for sure
Thanks =]
That was funny, I did both C++ and Python but now I'm more on C++ side. I had in mind the meme "look what they need to mimic a fraction of our power", I didn't tested it, but I bet If you change the proper compilation options that will be faster again in C++.
To my understanding this is what taichi do, it's general SIMD based on your current hardware, under the hood via LLVM optimizer based on the data structure (taichi is tailored for sparse data structure).
As you work with dense data Halide would give you [maybe] better results.
For all cases the code generated by python front end can be generated by C++, the python will always have an overhead. This is what Machine Learning people do, they don't care about python performances, because all the computation which too 90% of their frame is implemented on CUDA and C++, the python is here only to provide data to lower level system.
> "look what they need to mimic a fraction of our power"
Haha, true! In another comment, I said I loved that even if I write terrible C++ it still turns out pretty fast.
That said, the same argument could be reversed, if we consider productivity and third party library access.
If an application is 95% high level glue and one hot spot, I'd rather write the majority in Python and the hot spot in an AOT or JIT compiled variant of Python than write my entire app in a low level language. The overhead would be worthwhile from a productivity perspective.
> Proper compilation flags
Do you have flags you want me to try in particular? I did -std=c++11 -O3, but maybe I'm missing something.
> SIMD
Since this is all sequential, can SIMD help? I thought SIMD was for packing multiple of the same operations in a single instruction (but again, I'm not a C++ dev)
> the Python just provides an interface to a lower level language.
True! And I'm OK with that!
I def agree that well written, native code in a lower level will out-perform generated code from Python.
That said, for all but the most trivial algorithms, I can't write well-written C++. So, if I can get even a 95% solution for free from these high level LLVM interfaces, then I'm stoked!
@@dougmercer ( :
That remind me a benchmark done by Microsoft, Debug C++ /NoSIMD vs Release C# SIMD, and they notice faster C# :D Yeah sure... The point of Python is not to be faster, it's mostly to be gentle with non-engineer-long-beard programmer, the user are mostly scientist and data-analysts.
> Productivity
For this example I see no productivity differences between C++ and Python.
But personally I'm more productive in C++ with Eigen and few other lib
Like an experimented Python will be faster with numpy and his other favorite libs.
> Proper compilation flags
I don't know what is your compiler, but for Visual Studio:
/Ot {favorize speed}
/Oi {Inable Intrinsic}
To increase the STL speed, Disable C++ expcetion, "Basic Runtime Checks", /GS-, /GR- ...
To help intrinsic generation /Zp8 or /Zp16 (here you're processing int), but we can process
And based on your hardware /arch:AVX, ...
> SIMD
You have gather and scatter instruction that could help, need to profile ( :
> Improve
On both side I'll bet we can performance by using only type you need. If your number cannot go higher than 100 just use a byte/uint8_t, etc.
As I said the video was funny, the point is not to say Python is faster than C++, but more "if you're careful you can have performance higher or close to baseline C++"
I'm using g++, I'll try to find the analogs for the compiler flags you recommended.
And true, a uint8 is enough. I'll mess around with that too.
In any case, thanks for the comments! I'd def like to learn more about C++ but I don't get the opportunity very often
@@chkone007 Ok, I get the point but theres a lot of production code written in python, most code writing does not require performance and the few bits that do you can write a C extension or simply use C++ and python together
@@RicardoSuarezdelValle I kind strongly disagree. Did you ever experienced slow UI, stuttering App, lagging game, ... If yes, you already met a programmer who said "most code writing does not require performance". If you said a code does not require performance that just mean you consider your time more valuable than the user time.
As a developper we don't own time, the time is not ours, it's the user time. That's what make the difference between a smooth app, slow and memory heavy software, like everything web based, slack, etc. And all chromium stuff. Most of the devs said It's just a chat app, I don't need C++, just a chromium based. Consequences... My Mac/PC uses 8 GiB for doing nothing, just running a VM.
And in a industrial point of view, you can release your startup with python code and saying "how I don't care it's CUDA underthehood". You just expose yourself to have a competitor who implement his stuff on C++/CUDA directly and this competitor will explode his profitability because his AWS bill will be much cheaper.
We always require memory efficient and fast code. If none of those argument convience you, consider the CO2 argument, it's more eco-friendly for you PC or your server or your N-instances of your programmer running on AWS.
I love python to prototype idea, and accelerate my exploration of ideas, but I cannot be serious with that to my clients. I know lot of "AI startup" are like that, download the model from the researcher, create a docker, build a website => step 2 => profit. Most of them rely on Python, but any competitor with cheaper infrastructure can scale more and be more efficient.
I had in mind Facebook developed on PHP fine, cool, but at the beginning each new user cost more than the previous one, ... FB wasn't able to scale. They create "HipHop" compiler from PHP to C++, and now the company became profitable each new user became cheaper than the previous one.
Conclusion => Performance always mater.
Don't read me wrong, that doesn't mean I over-engineer everything to save 1 byte or 1 pico second in median. But keep in mind the quote "early optimization is the root of evil" was written from a time when everybody was written C and assembly code... The code is different, today with python, javascript, ... "early non-optimization is the root of evil".
The video editing is top notch too!
Thanks =]
Damn, I love your channel from parsing 1 billion rows of data
Thanks =]
@@dougmercer I'll wait more videos, even if it will be a year, bc I find this video just bc of my recommendation)
@@SobTim-eu3xu a new video should be out before the end of the month!
It'll be about a new library I made and published to PyPI, called 'signified'
@@dougmercer oh, nice, congrats to you about library, and yay, new video ahead!)
Amazing content, engaging presentation and sadly, underrated channel. Subbed!
Thanks so much! Be sure to share with friends/coworkers you think might enjoy this, and hopefully the channel will grow over time 🤞
@@dougmercer keep up the good work, it sure will 🙌
Excellent video. 👏🏼 Subscribed.
Thanks Manuel! Glad to have you =]
10:48 I think it would be nice to define on the left that lower is better (I've usually seen it done in benchmarks). Thank you for the video! About CPP, I think you might've used SIMD instructions.
Good point, I def could have made the metrics interpretation clearer.
As for SIMD, it's hard to parallelize this because it's an inherently serial problem (everything requires previous solutions)
Amazing work, as someone who has to use python against my will, I enjoy your videos
Thanks =]. What's your preferred language if Python is against your will?
@@dougmercer Haskell is my love and I like lambda calculus so I am writing a interpreter and compiler for my own lc implementation for fun. (in haskell)
@@GyanUjjwal-m4u very cool. I haven't touched Haskell much, but I'm learning ocaml for fun recently and enjoying it
@@dougmercer glad to see you join the functional land.. enjoy!!
Im enjoying the video, serious question though. How can jit be faster than c++? Did you have the c++ optimizer on?
Nevermind, found a comment where you said that you used -O3. Great work.
I feel like anyone who complains about your c++ isn't being fair. While i may have done it another way, its valid
Probably means that I left some performance on the table in the C++, or the JIT pulled some tricks that most people wouldn't pull when writing it natively.
Someone else in the comments found that using a flat 1D array gave the C++ a 1.1-1.2x speedup. That probably puts it on par with the Numba/Taichi ndarray approaches
That said, the point of the video still stands-- for at least this particular problem, there are several approaches for getting performance on par with native C++
Absolutely! Fantastically well done. I'm really quiet impressed by what you did.
Thanks =]
Great video!
From what I've tested, your C++ code is good enough.
The main bottleneck of your code seems to be the dp result variable.
I was able to double the speed (from 3.78832 to 1.77546 seconds) by replacing dp 2D array by two 1D arrays: one "current row" array and "previous row" array, and swapping references around at each iteration.
This probably because the code don't have as many cache misses by not fetching new rows of the "dp" array, which are filled by zeros anyway.
I did not test this with the Python code, but the same speedup should be obtainable by using two variable (or an tuple of 2 arrays) to keep up with C++.
Good point! I may have to re-run this experiment at some point-- I wonder how Numba/cython would perform with that more memory efficient approach 🤔
Such a cool video, why did I only find out about this now?
¯\_(ツ)_/¯ welcome!
Thanks! This is a really great info!
Glad it was helpful!
Great video! Did you compile the C++ code with optimization flags?
Yup! You can check out the C++ code/compile command here, gist.github.com/dougmercer/1a0fab15abf45d836c2290b98e6c1cd3
Nice video! Well done
Thanks for watching!
thanks for the video!
I am writting a paper about speed in coding, both when coding and executing, and Im comparing C, C++, Rust, GoLang and Python, Python takes the crown in speed to write the program always (Im not taking just bare time it took me, but also amount of characters the program has and complexity of sintax is taken into account), but C it's just perfect when what you need is performance, and in the end, python is just another language based on C.
Are world is C, it has always been.
Very usefull. A quick question, what eas the optimization level for compiling the c++ code. It can really make a diferrence.
I used -O3. Another commenter recommended using a 1D array and handling indexing through arithmetic, and that does speed up the C++ by about 1.1-1.2x. (still pretty similar to the ndarray approach from Taichi)
Here's the c++ code and build script if you want to play around with it yourself =]
gist.github.com/dougmercer/1a0fab15abf45d836c2290b98e6c1cd3
4:00 where do you get this footage?
Storyblocks
with the c++ version, did you compile it with -O3 optimisations enabled?
Yup! gist.github.com/dougmercer/1a0fab15abf45d836c2290b98e6c1cd3
@@dougmercer thanks! i just managed to get it 169% faster (see fork). still, the speed improvements offered by numba, pyx, and taichi are really impressive :)
Very cool! Yesterday I implemented the 1D index approach (not nearly as cleverly-- just hand jammed the indexing arithmetic in line) and I got about 1.1-1.2x speed up.
Does the noexcept make a difference in performance? Or is there something else causing the extra 0.4ish speed up 🤔
Its nice... but which language would be more appropriate for handling strings... like in bioinformatics data?
Hmm. In past, for bioinformatics, I've used plain Python and made heavy use of bytestrings (instead of Unicode) and memoryviews (to avoid creating copies of the data).
You could potentially try PyPy. My latest video used PyPy from file I/O and string manipulation... I suspect it'd be a good choice for bioinformatics.
If limited to choices in this video, I suppose I'd pick Cython
Wonderful video, even for a beginner like myself! I wonder if you could share the animation tool you used? I feel it would be awesome for my presentations :))
Thanks!
I primarily used Davinci Resolve, but used the Python library `manim` (community edition) for the code animations.
Thanks! @@dougmercer
The ending of the video is what your programs see when you end a process
I do wonder if using normal C++ arrays rather than std::vector would have made the C++ the approach faster. Also, I think it could be faster if dp would also be passed by reference just as a and b are.
Thank you Doug for this awesome video!
Btw, just curious: has anyone tried some of this on Pygame?
I know Python it's not a common language in the videogame industry, but maybe some of this could bring it some justice (and good surprises).
You can definitely use Cython or Numba to help speed some things up with pygame.
I found a few old reddit threads that included demos and discussions by searching "Numba pygame reddit".
I feel like you could get similar performance using lupa + lua_importer or nimporter/nython. Both lua and nim are similar in difficulty to python, though I think nim is somewhat like rust when it comes to how to code it.
This is my first time hearing about either of those. Very interesting 🤔
Another option - learn Nim. It is an easy to learn language with a pythonic syntax. Because Nim is a compiled language, it's speed is on par with C, C++ and Rust.
I've been meaning to give it a shot... It definitely seems very approachable
Your "faster than C++" claims cannot be taken seriously if you don't show your C++ code, and don't invest the same effort in optimizing that C++ code.
EDIT: apologies I missed the C++ slide at 2:22; as expected, this is not a very well written C++ code. Leaving aside the 1-base indexing that wouldn't affect performance, nobody who's writing computation intensive code would allocate a 2d array with vector; the standard way is to use vector with manually computed indexes. Your fastest "python" version is parallelized and uses a globally allocated dp array -- did you try doing the same thing in C++? I say "python" in quotation marks because these tools aren't quite python anymore -- these are dialects that heavily rely on static typing to get their performance claims; while it's the dynamic nature of python that is both it's strength (expressiveness) and weakness (performance).
Other things that a performance-oriented C++ dev would do that you may be surprised that they make a difference: replace 'if' with a ternary operator (easier for compilers to optimize), use restrict pointers to access the data in the arrays (helps automatic vectorization), allocate arrays with new/malloc (avoid unnecessary initialization of the arrays).
Other things you'd need to disclose: what system and compiler flags you use for testing the C++ code; how did you measure the performance of either of these; what did you do wrt the python's GC.
I'd also point out that this case-study analyzes a single function in isolation, which makes it a not very good representative of real-world applications.
I did show the C++ code. th-cam.com/video/umLZphwA-dw/w-d-xo.html
@@dougmercer Yes, I did notice that after I posted my comment. The fact is still that that's a very ill-written C++ code which you didn't invest any effort in optimizing. The only valid comparison here is of badly written C++ against idiomatic Python which shows that the C++ is nonetheless 100x faster. Great...
@ybungalobill Addressing your edit--
> "fastest" version was parallelized
None of the Python code was parallelized, because the problem is inherently not very parallelizable (each DP entry relies on previous solutions. At best, you could parallelize the "wavefront" of the DP, but I did not do that). I mentioned that taichi *would try to parallelize it* if I didn't explicitly tell it not to. So no parallelization here!
> Static DP array
It is true that allowing the revised version of the Taichi approach to use a globally allocated DP array is a bit unfair. The other approaches (including C++) would also have benefited from that. However, several of the approaches were faster than my C++ without that optimization, so let's focus on those and ignore the statically allocated Taichi approach.
> C++ isn't optimized.
You are definitely right that further optimizations could have been made to the C++. At the end of the video, I even admitted it! So, thanks for recommending a few things that could be improved.
As an admittedly *terrible* C++ programmer, I don't know what a "restrict pointer" is. I did mess around with a ternary if/else and didn't see a performance difference on my machine. I did not mess around with 1D indexing, because I wanted my implementation to match my Python semi-closely. I explicitly wanted to compare what a Python programmer would write if they were trying to re-write their code in C++, rather than creating some nightmarish SIMD optimized, wavefront parallelized, hand written code to most efficiently solve the LCS problem. That's simply not the code that my audience would write in the circumstances where they would reach for these JIT/AOT-powered tools.
If you wanted to write a highly optimized C++ implementation, I'd be happy to test it and include the results in a pinned comment.
> These aren't Python anymore.
I agree in theory, but disagree in practice.
Here's my take-- if I can `pip` install a package, use that package using Python syntax, and easily interoperate with my broader Python project, it's Python-enough for me. Numpy isn't "python", but I consider it "python"-enough. Importing and using numba's `jit` decorator or writing `taichi` is far less burdensome than, say, hand writing a wrapper function in C++ to expose using `ctypes` (which I have done in the past, and hated every minute of).
All that said, if you still want to keep your downvote on my video, feel free! Sorry you feel that way, but I guess my video wasn't a good fit for you. In any case, thanks for your feedback!
@ybungalobill "badly written" is one entirely valid way of describing it (and I even admit that-- calling my implementation "naive" here th-cam.com/video/umLZphwA-dw/w-d-xo.html) , but "the C++ code that a Python programmer with little C++ experience would write" is another.
As a channel predominantly focused on Python content, my intention was to make a quick, but honest attempt at reimplementing my Python solution in C++, and compare its performance against what I could achieve with JIT/AOT Python libraries.
For some reason I couldn't @ you in my other reply, but in my other reply I made a few minutes ago I try to address some of your other feedback.
@@dougmercer Thanks for your comments; looks like we mostly agree here; the difference seems to be that you've limited yourself to a Python dev perspective while I'm looking at the broader picture.
While I understand the reasoning behind doing a 1:1 translation of python to C++, it reinforces the incorrect mindset that the difference in languages is purely syntactic.
You see, I got a link to this video from a coworker as evidence that "python can be faster than C++, you just need to do these X Y and Z and then magic happens". And at a glance that's what this video communicates. If that's not your goal, you may want to adjust something in your presentation.
Very nice video! I want to try mypyc
interesting and useful! Subscribed.
Thanks! And welcome =]
Interesting results!
Thanks! I was surprised too
So, I'm not a big python guy so I was curious. I repeated your experiment for C++ vs numba. Only real difference: for the C++, I rewrote it just a bit (used auto and changed the indexing a bit to be more c-like) and I wrote the function as a template in which the size m and n were the template variables. This allowed me to change from a vector to a stack allocated array, the main benefit I believe being that the whole memory is contiguous and allowed for better caching. The C++ version was about 1.5x faster than numba on my machine. I really enjoyed this video though! Made my question my biases, and I think there's alot to be said by letting compilers/optimizers do the thinking for you. I think this was really insightful and I think I'm gonna give the numba one a go for many of my future quick projects.
Oh, that's awesome! I think that's the fastest anyone has gotten it so far!
Someone else in the comments encouraged me to try a 1D vector of size (m+1)(n+1) and index into it with arithmetic -- that gave me a roughly 1.1-1.2ish x speedup over the original C++ . So, I guess much of the remaining speedup came from data locality-- very cool that it was another 0.3x-ish boost.
I'm glad you found the video interesting =]
Really interresting Video. I‘d love to learn more about it. Maybe I will be laughed at for this statement, but even with this video i feel like bringing python to C-Level performance seems to be quite a bit of an effort. Isnt it worth it to learn C/C++ for special tasks? How would you evaluate the developer‘s expirience comparing „Make everything possible with Python“ with „Learning C/C++ or Rust“?
Thanks a Lot!
You're right! It's not easy to get C++ performance in Python.
I think these tools are appropriate when there are a few "hot spots" in your code, but the majority of your application benefits from Python's ecosystem.
It's possible to directly build C extensions and call them from python, but I think these tools are way easier.
For some (new) projects, it might make sense to write the whole thing in Rust from the start.
In practice, most of my projects use a lot of Python libraries, and my team is not very flexible (they mostly only know Python), so it'd be pretty disruptive if I wrote a critical component in a different language and with different tooling.
Good question! (Sorry I don't have a good answer =P)
@@dougmercer that is indeed a good answer, thanks. Since I am working in the Data analysis field (geospatial) I love Python for its possibilities. I was wondering if it makes sense to learn another language for intensive calculations like C++. But think I will try your tools 😊 Many thanks!
for anyone interested, I copyed his c++ code and his example with 30000 elements in each vector, and in my computer it ran in ~25 seconds (my PC is slow). By simply compiling with -Ofast, it got down to ~5 seconds, still without modifying the code at all. I'm not hating in the content of the video, wich in fact is great
I compiled the C++ with -O3 for this video gist.github.com/dougmercer/1a0fab15abf45d836c2290b98e6c1cd3
@@dougmercer yeah I tought of that, but as (I think, it's been a while since I watched your video, and I really can't check now) you didn't mention anything, I assumed not. Well this makes it more impressive then for python, nice!
You're right, I didn't mention anything. Next time I'll be sure to show these sort of details in video.
Was your 25s measurement with no optimization?
I wonder what the difference between -O3 and -OFast are for a problem like this. (I'm admittedly not a C++ guy, so OFast is new to me!)
@@dougmercer yeah my 25s was with no optimization, and I guess -O3 and -Ofast shouldn't really make the difference here. As far as I know, -Ofast's main difference from -O3 is that it can change the order of float operations, wich is kinda illegal since a + (b + c) != (a + b) + c for floats, but this code in particular does not have any float calculations so it should be basically the same speed
the speed up at 2:26 is a funnier number than 100x but also much lower 2:56 minutes -> 176 seconds / 2.56 seconds
Ah, the clock visualization is confusing. The vanilla Python approach did take 256 seconds, not 2 minutes 56 seconds.
How did you make those amazing looking bar charts?
Hah, *very carefully* in Davinci Resolve (Fusion Page) =P
I manually drew the graph using rectangles, then applied (noise + displace) to make it more irregular + (fade it out with noise + the "painterly" effect from Krokodove) to give it the water color appearance + paper texture + adding lens blur
One of my favorite animations I've made =]. Thanks for commenting on it
Came for the video, stayed for the stock footage inserts x)
=]
I also used Nosferatu in my other video called "Your code is almost entirely untested"... I wonder what it means that I keep putting horror movie clips into my Python explainers 🤔
Considering the effort put to make the cython code run, i'd much rather just write C or C++ from the start
Hey, thanks for an amazing video!
Which one would you suggest so that I can just grab my regular python code with dataclasses and get a performance boost with no tweaks whatsoever?
Thanks for watching! =]
I'd try mypyc first. The others are way more disruptive and would probably require changes to your code
@@dougmercer Thanks for your answer!
I was thinking of something.
Nowadays we almost always use type hints because they are great.
But only for clarity/type-checkers like mypy.
So we are not getting any performance benefit out of it, although I think we could have!
Cython translates python to C and forces us to write statically-typed python for that. Which type hints could also be used for...
Turns out that Cython supports type hints as well!
Then we have stuff like MonkeyType that allows us to automatically type-hint code based on runtime behavior. Nice for annotating legacy code.
1) we write python code with type hints
2) if needed apply MonkeyType to apply them everywhere
3) compile with Cython
4) get a C-like performance
I wonder why it's not actually practiced.
Do you have any idea?
Mmm, for using type hints to achieve better performance through compilation, I think there's a high level design question: "should your code (1) look/feel like vanilla Python, or (2) are you OK with using non-standard Python features, or (3) are you willing to use syntax that only works in your special language, as long as it still vaguely resembles Python and interoperates with it"?
I think mypyc is the closest to achieving the goal of speeding up vanilla Python.
cython's python mode is pretty OK, but you need to add extra metadata to make it be performant (e.g., the locals decorator). Cython also has its own type system rather than using Pythons built-in types (e.g., cython.int vs int). Cython as a language (in non-python mode) isn't really Python any more, but interpolates with it well.
Some other languages (e.g., Mojo) claim to have a "python-like" syntax and support interacting with Python, but the code isn't really Python.
@@dougmercer Yeah it would be amazing if we could just write vanilla python with standard type hints and compile it with Cython.
Apparenly Cython somewhat supports it. TH-cam blocks my commend if I paste a link but you can search this on google:
Can Cython use Python type hints?
Because todays type hints are everywhere and we don't get any performance benefit out of it at all, which feels weird.
It's hard to say-- when I was experimenting with this problem I remember not observing any speed up when adding vanilla Python typehints, and it wasn't until I started adding things like the @locals decorator that I really noticed any improvement.
Let me know if you do any testing that shows a meaningful speed up!
How do you optimise python performance without any external libraries or programs? Just native python3 with the standard pre-installed libraries.
Hmm, I guess the only way would be to write efficient code. I'd profile the code to see what functions are taking the most time, and then focus on improving the slow/frequently called ones
Use the right data structures/algorithms. consider using functools.cache to memoize anything that would benefit from caching. reprofile your code after each change to quantify what changes were helpful.
You can technically write your own c extensions if your system has a c compiler, but that's probably not what you want.
fantastic thanks a lot
Glad it was helpful =]
Glad you are back!
And then there's Mojo, the one that will swallow Python in a serpently fashion. It's basically Python++, the Python superset.
Thanks se se! =]
I think mojo is very cool. That said, from what I know, I believe their license was restrictive for commercial use? Maybe I'll eventually do a follow-up video on it and the other proprietary Python superset that Im failing to recall the name of if this video does well.
I also skipped over PyPy, for the sole reason that it failed to install/run on my laptop. ¯\_(ツ)_/¯
if python++ was that good, cython would already be the big thing. I feel like this approach suffers from both worlds: it's harder to understand how a program works compare to python, so nobody uses it instead of python, and it's harder to optimize than c++, so nobody uses it instead of c++.
I'm missing nuitka on that comparison, but very cool.
I've never tried it! Does it work well? I'll have to mess with it sometime 🤔
That said, I am working on a video where I cover one library that I wanted to include in this video (PyPy).
What about optimizing your C++ implementation instead to go faster ?
Due to the heavy "readability" over efficiency mentality in the online Python community, which I personally just ignore, many examples of Python code use excessive memory and have redundant and/or inefficient logic. An infamous example is ArcGIS Pro using Python heavily but seems to be written by someone who knew C/C++ and tried to use conventions from C/C++ in Python resulting in hundreds of redundant functions that could be reduced to just a handful. I've also seen this following coding template before too many times:
var1 = "something"
var2 ="var1
Why?! I wish I was joking. Can someone explain to me why on Earth in any language and/or circumstance you'd want to do this?
Great video ! But why you did not put "ti.init(arch=ti.cuda)" instead of "ti.init(arch=ti.cpu) ????? This way Taichi will use the GPU instead of the CPU and the result is even faster. For instance a for in range(0, 1000000000) loop gives .05123 second instead of 2.4 sec. with C++ on a Nvidia RTX 4060. With "ti.init(arch=ti.cpu) you get "only" 0.081 sec.
So 1.59 times longer than on my GPU.
For this problem, the algorithm isn't parallelizable, so using GPU would slow it down. For the fractal visualization I did early in the taichu section, the GPU makes a world of difference!
Was your taichi (arch) based on cpu or gpu when you carried out the benchmark testing ?
The LCS dynamic program was on CPU. The visualization I showed at the beginning of the section of a kind of warping fractal was on GPU.
@@dougmercer Thanks. Taichi certainly looks promising, but I still prefer Numba for its simplicity, i.e. adding a couple of decorators, without altering the code too much. Have you tried Spark and Dask ? They're both parallel programming libraries.
Yup, both are great! Since this problem couldn't be easily parallelized, I didn't mention them.
And I agree, in general Numba will be easier than Taichi by a long shot. I just thought Taichi was kind of neat so I included it in the video ¯\_(ツ)_/¯
Did you compile with -O3 in c++?
Yup! gist.github.com/dougmercer/1a0fab15abf45d836c2290b98e6c1cd3
Some people in comments have gotten between 1.1-1.7x speed up through other improvements, but it doesn't really change the narrative much: these compiled Python tools frequently give good enough performance
@@dougmercer thank you! I think is really problem dependent. In some codebases I worked I had for example a 40x speed up over cython or numba by embedding very very small pure C functions using ctypes.
Oh definitely agree. Squeezing out performance is always "it depends" and "did you profile it?"
Yes, in my case there were two issues, the first was that cython for some things relies on the python interpreter if data and objects are not managed in the most cythonic way, the second was cache misses. I was working on a kd-tree implementation and a tiny detail on how nodes are managed let me cut out on cache misses during tree traversal. For that purpose I used perf to sample from the process but I know for sure that there are many other options for doing that.
@@dougmercer Moreover, numba is a life saver if you need performance on the fly without many refactors.
Numba doesn't really work outside of toy examples
You can't typically just slap it on your main function, but I've had a lot of success applying it to hot parts of the code (with a bit of refactoring)
Hey man, very interesting content. Some unsolicited advice that is meant to help and not be mean, but in my opinion all the stock video you use to try and describe every single sentence is a bit distracting and doesn’t add value and background music is a bit loud/unnecessary. Very interesting content though 👍
Hey, thanks for the really polite and sincere feedback!
I agree with both points. In more recent videos, I've used less and less stock footage, and I think I've gotten a bit better at mastering my audio to keep my voice easier to hear.
Hopefully I keep getting better at this moving forward.
Cheers!
@@dougmercerI should have checked out your latest stuff before commenting, I will check it out now, and subscribe 👍
But did you put as much effort in your CPP implementation as your python implementation? I love python as much as the next guy and I know a lot of python peeps don't want to write CPP but, at some point you gotta really wonder, "should I just learn CPP?"
In some of the other comments, people were able to squeeze another 10-20% performance out . It doesn't meaningfully change the msg of the video.
what about pypy?
I'm working on a video that uses it right now =]
You should not use vector of vectors in c++
First of all you will allocate memory m+1 times (for each of inner vector). This is slow.
Also this data layout is not cache friendly because each vector will be allocated on its own and whole table is scattered around.
What you really should do is define one big (m+1)*(n+1) vector and use this contiguous space as if it has two dimensions like this v[i*m + j]
So you skip i rows then select j column.
I bet you can easily beat python with this simple modification. Also be sure to compile it with at least -O2 optimization in release configuration so no debug stuff will slow you down at runtime
Another commenter actually already tried a single contiguous vector. They found that -O3 optimizes away any difference in performance.
Here's the comment thread where they talked about their attempts th-cam.com/video/umLZphwA-dw/w-d-xo.html&lc=UgyNE2s94tUKjG3hayF4AaABAg.9rJ4vi7-9UyA-ES8Dn0d1t (needs to be opened on desktop)
Here's a gist to the implementation and compile command used in the video gist.github.com/dougmercer/1a0fab15abf45d836c2290b98e6c1cd3
So, feel free to let me know if you get a significantly faster -O3 optimized version. If you do, I'll pin your comment.
If you will experiment try to change i and j in v[i*m + j]
By changing it you will change memory traversal order from row major to column major. This will change cache misses ratio and resulting speed.
You can google for cache friendly data layout to learn more. Those things are very important if you want speed!
Building libraries
Length 30000 Sequences, 10 Reps
Time using lcs_taichi_ndarray: 23.286619998994865s
Time using lcs_taichi_field_once (this one is cheating): 18.95515070901456s
Time using original C++: 26.1285s (n_rep=10)
Time using O(n) memory C++: 26.7876s (n_rep=10)
Time using flattened 2D into 1d C++: 22.3163s (n_rep=10)
so, a 1.10-1.20ish speed up, but not enough to meaningfully change the analysis.
@@dougmercer totally might be true because compilers are very smart with optimizations now days. Table is a local variable so compiler is allowed to do basically anything( as long ad observed behavior is not changed).
Also difference might be invisible if whole table can fit in cache. I dont remember size you tested.
Anyway it is good to know that you already familiar with this little details.
If everything is done properly it is really interesting why c++ looses some speed. I should look at your video more carefully…
I had two input arrays of length 30,000. so that would induce a 30,000 by 30,000 matrix. So, kind of big?
That said, the 1D indexing did close the gap between the taichi ndarray approach and the C++. So, I don't think it lost any significant speed to taichi.
Reason being, the the taichi approach that allocates the field once is unfair (insofar as other approaches could have also made that optimization, but I didn't implement them).
Love you channel! Nice 80ies sound!
Thanks! I had fun choosing music for this one =]