(surprisingly ?) Matlab to Julia happens to be the most difficult translation task because those two langages may look similar but are rather different. Coming from C++/Rust or even Python is easier. In addition, I think that the speaker (and his team) should have rely more on the super helpful Julia community (e.g. Discourse) to help them translate their Matlab codes. I think the "fast" final version of Julia that is presented is still pretty far from idiomatic fast Julia code. That being said, the author is right when he emphasis that Julia is a very powerful language that requires some serious training to get good performance. It is not a surprise for HPC C++ devs and Matlab users tend to forget how long it took them to rephrase all their algorithms in a matrix form (get rid of the loops, consider that a vector is a special kind of matrix, avoid ND arrays when N is not equal to 2 etc.)
MATLAB defaults to MKL which out of the box has much faster linear algebra than OpenBLAS on many operations for many CPUs. Julia and most open source languages (R, Python, etc.) default to OpenBLAS because shipping with MKL can cause licensing issues. This means that if someone times a code that's just a matrix multiplication or \ operation right out of the box, for a sufficiently large matrix (100x100 or so), then they will see MATLAB as faster simply because it's using MKL as the BLAS instead of OpenBLAS. From what I can see from the Discourse threads people have posted for "why is this MATLAB code faster?", this tends to be the concrete reason for the difference. I don't think we should ignore this at all. With SciML we actually default to using MKL and AppleAccelerate (Mac M-series chips) internally, i.e. we always ship with an MKL_jll binary (if one exists on your platform) and have CPU checks that choose the default so that LinearSolve is using MKL vs OpenBLAS vs AppleAccelerate to be the fastest one based on benchmarks we have done on each platform. This can make about a 5x-10x difference for many people, and it's one of the reasons why SciML code can be much faster than something simple that is just handwritten in the language. This really demonstrates that for real usage it's not a small factor. Julia has MKL.jl where if you just do `using MKL` it will override the BLAS shipped with Julia to be using MKL. Note this is not what SciML is doing, it instead directly uses the MKL_jll binary to not flip the libblastrampoline. However... I really want to be doing this to most people's computers. The majority of people would get a speed improvement from this, even with AMD chips sans Epyc. Epyc seems to be the only platform on which MKL is slower than OpenBLAS, and on most platforms it's about 10x faster (on Eypc it's about 2x slower). So if I could pull out a hammer and instantly apply an effect, it would be to just (1) default to MKL (2) change default to AppleAccelerate if M-series is detected, (3) change default to OpenBLAS if Epyc is detected, and 99% of people would see linear algebra go a lot faster. MATLAB does (1) but it doesn't even do steps (2) or (3), so we can beat it there pretty easily. All of the tools to do this exist in the language, it just needs to be set as defaults rather than assuming the user knows to do this.
I'm very curious about whether the Pre-allocation trick is also needed in C++. I don't have any high-performance computing experience, but my narrow understanding is that one need to manually take care of memory allocation for some critical function in a loop by lifting all temporary variables outside of the loop to prevent memory allocation. If that is the case, Julia's preallocation problem is not that bad since it is the limit of today's compiling technique but not Julia. I also wonder why the transformed Julia code is that slow compared to the MATLAB one, since MATLAB also use garbage collection (GC). Julia and MATLAB should waste similar time on it. Preallocation in Julia can make the code fast, but not doing it should not make it much slower than MATLAB (only due to GC). I must say preallocating EVERY temporary variable is really tedious, and changing from the simple b=A*x to the mul!(b,A,x) make the code not clean. I think progress should be made here. I don't know whether there is already some package that solve this kind of problem in general. Hope more grammar sugar is offered here, e.g. make the assignment b=A*x to b in place by adding some clever macro here.
in C/C++ preallocation is often lifted not just outside of loops, but also outside of functions. basically, you give each function an array of bytes to use as a scratch space/ place to output results. that way you can reuse memory allocation between multiple functions. so yes, practically every function becomes an in-place one, and the same 4-5 allocations can be rotated to store intermediate results throughout a large pipeline. newer languages (zig, odin) try to refine this concept into context allocators (does basically the same thing, but the pointer to scratch memory and result memory is passed to operations implicitly, which maintains the "nice" syntax)
This talk is motivated by case study of porting MATLAB project to Julia, so as a counterpoint to it, I will suggested watching Jonathan Doucette talk "Matlab to Julia: Hours to Minutes for MRI Image Analysis" from JuliaCon 2021. How it name suggested porting of MRI image analysing code from MATLAB to Julia with 60X speed gain. Literally, hours were reduced to the minutes! th-cam.com/video/6OxsK2R5VkA/w-d-xo.html
Interactivity and performance do not go as well together in julia as advertised. As a simple example it is advised to use constants instead of global variables for performance but redefining them in a notebook environment leads to errors/warnings. Similarly structs too can not be redefined without rebooting kernel.
Imo deep learning libraries prove that memory managment should be handled by the compiler. Jax is a compiler written in Python. If they made a language instead to remove its downsides (bad error traces, ugly control flow syntax) then it would be perfect
I wouldn't go that far. Deep learning libraries have proved the limitations of memory management by compilers. For a long time people touted GHC (Haskell) as a clear sign that functional programming languages would rule the world because they can prove certain things about memory to auto-optimize some things that can otherwise be difficult. Jax, as a functional programming language, is simply a third iteration of that now targeted as a DSL to machine learning engineers. Now, there's a reason why you never see a BLAS/LAPACK written in Haskell and that's because people were always able to find ways to greatly outperform GHC, in practice "a sufficiently smart compiler" is never smart enough. And that's what you see with Jax as well. It's not hard to write a code that's about 10x faster than Jax, see for example SimpleChains.jl hitting about 15x on small neural networks and DiffEqGPU.jl hitting about 20x-100x faster GPU speeds due to using kernel generation rather than array primitives. Another example of this is the llama2c greatly outperforming the Jax translation. So clearly Jax isn't fast because you can pretty easily 10x it if you know what you're doing. Having the ability to do things manually is thus still essential to a fast programming language which is targeting general purpose use. What Julia should learn from Jax is that for the majority of individuals, a simple memory managements scheme by a smart enough compiler can give good enough results. What Julia is missing is proper escape analysis so that simple mathematical calculations will use a stack rather than a heap and smartly reuse memory. Jax has done this well, and Julia has not and that's probably the main thing that most new users run into. I think improving that experience while also allowing all of the modifications necessary for doing things manually is thus what it needs to evolve into as a general purpose language. I'm actively talking with the compiler team to use some of the SciML manual improvements as examples and test cases for such improvements to the compiler, and there's currently some work going towards such escape analysis features.
@@chrisrackauckasofficialYour comment about Julia and Jax is extremely biased and could not be taken seriously. For one, the real market place doesn't care about the 10x or 100x speedup on small neural networks. Second, it's not a big deal to find out your hand written kernels are faster than array based algorithms, you should really compare kernels written in Julia vs actual kernels written in python to be a valid point.Finally, you seem to be unaware of the fact that torch.compile can compile a piece of numpy code into cpp code for cpu AND cuda code runs on GPU. I'd love to see a comparison with that instead. But I already know that python would win since it's a meta language that can seamlessly generate higher performance code.
@@nickjordan6360 Let's take this one at a time. (1) "For one, the real market place doesn't care about the 10x or 100x speedup on small neural networks.". That's not the case in all markets. There's a growing market using neural networks on embedded devices as surrogates for things like model-predictive control. In fact, the speaker in this very talk is someone of this market, as the kind of microcontrollers that tend to be on washing machines tend to measure as having on the order of MBs of RAM. These are the kinds of applications which many industries are looking to target some form of learned surrogates. (2) "Finally, you seem to be unaware of the fact that torch.compile can compile a piece of numpy code into cpp code for cpu AND cuda code runs on GPU. I'd love to see a comparison with that instead.". That is the comparison. The new default JIT in PyTorch is NNC which is a tensor expression fuser, which is a design first done in Halide and is very similar in nature to Jax's JIT. The comparison was done with these. The point though is that the machine learning accelerators do have a JIT heavily optimizes towards the assumption of having specific kinds of tensor operations, and things that are not deep learning (like solving ODEs) do not necessarily have the same structure. You can see these timings in detail in all combinations of vmap and JIT with Jax here: colab.research.google.com/drive/1d7G-O5JX31lHbg7jTzzozbo5-Gp7DBEv?usp=sharing. There's a small overhead to calling the Julia functions since the benchmarking in that link is done on Collab from Python, but it still demonstrates a 10x against JIT'd Jax functions. You can see that the JIT actually doesn't make a noticable difference though, a quick profile would shows you that the dominant cost is between non-fusable operators. The peer reviewed article (www.sciencedirect.com/science/article/abs/pii/S0045782523007156, or open access version arxiv.org/abs/2304.06835) goes into detail describing how this is a direct consequence of the way that the JIT compilation is occurring, showing that you get a similar performance in Julia too if you do the parallelization and compilation in the same way. And this shows that CUDA kernels written directly in CUDA C++ match the performance of Julia. What this shows then it's not a language thing at all, it's how you do the JIT compilation and the parallelization. The domain-specific accelerators of PyTorch and Jax do this in a very specific way that tends to be good for deep learning problems, but this is a demonstration that it's not a general-purpose accelerator in the sense that it's making deep underlying choices in the archiecture of how that's compiling, and the "how" can be orders of magnitude off from something that is optimized. The general remark here is "of course", in fact some reviewers said it's obvious that the detailed architecture would outperform by an order of magnitude or two, so I'm sure that it's not too surprising of a result, but it highlights scientifically useful cases where directly writing code can outperform accelerators (3) "I already know that python would win since it's a meta language that can seamlessly generate higher performance code." I don't quite understand what you mean by "win" since there's no competition. There's lots of interesting choices being explored, each having advtanges and disadvantages. No engineering choice can be made without some kind of trade-off! I think the important thing to understand with each tool is the trade-offs being made and the reasons behind these trade-offs. I myself regularly contribute to open source libraries in Julia, Python, and R (writing a bit in C and these days trying some Rust on the side) in order better understand these trade-offs. I think the answer here to the speaker's question is something that is nuanced in dealing with such trade-offs. It's good to know what performance is being missed by high level accelerators and but also some of the usability gains. Jax in particular does some interesting things with memory that would be good to incorporate into Julia to improve memory, though some of the other choices (like vmap and its compilation as highlighted here) are more domain-specific and so the level to which some of the optimizations should be done when the compiler is used in contexts outside of ML is a fairly nuanced topic. I think that what's really required is a set of optimizations to eliminate memory allocations in contexts where it's easy to approve no escape occurs (similar to Jax and PyTorch JIT), but without trading off that all memory has to be handled through this system. I think it would look similar to C++'s RAII in its lowered form, though feel more like a Jax JIT thing to the high level language user. This would allow for fully preallocated handling by advanced users but get the "standard" user up to Jax/PyTorch JIT speed, would increase the complexity of the compiler a bit but I think that would be a good trade-off.
@@nickjordan6360 TH-cam has an auto-deletion bot mechanism support.google.com/youtube/answer/13209064. One of the things that can trigger it is too many links that 403 redirect, but also negative or hateful phrases.
Huh? My cellular network simulation ran about 100x faster in Julia after first converting it from Matlab. I've never had a program run slower in Julia compared to Matlab or Python. This case seems unusual to me. Maybe I just program differently. I do pay attention to optimize inner loops and use @view where appropriate.
I have never experienced julia being slower in this way when compared to matlab in any of the very computational heavy scripts i’ve written. To address the main point yes really bad julia code is going to be worse then really good matlab code. I agree that julia should make it easier to write good code, that is true for every language and every language should be faster, but this just doesn’t seem like a fair comparison. If I were to steal man this guys point though. It seems like julia can make a lot of head way in terms of ease of implementing performant code by taking a second look or overhauling its approach to temporary variables and when allocations are happening and in what scopes.
The 1+1/2 language problem is inevitable.
(surprisingly ?) Matlab to Julia happens to be the most difficult translation task because those two langages may look similar but are rather different. Coming from C++/Rust or even Python is easier. In addition, I think that the speaker (and his team) should have rely more on the super helpful Julia community (e.g. Discourse) to help them translate their Matlab codes. I think the "fast" final version of Julia that is presented is still pretty far from idiomatic fast Julia code. That being said, the author is right when he emphasis that Julia is a very powerful language that requires some serious training to get good performance. It is not a surprise for HPC C++ devs and Matlab users tend to forget how long it took them to rephrase all their algorithms in a matrix form (get rid of the loops, consider that a vector is a special kind of matrix, avoid ND arrays when N is not equal to 2 etc.)
MATLAB defaults to MKL which out of the box has much faster linear algebra than OpenBLAS on many operations for many CPUs. Julia and most open source languages (R, Python, etc.) default to OpenBLAS because shipping with MKL can cause licensing issues. This means that if someone times a code that's just a matrix multiplication or \ operation right out of the box, for a sufficiently large matrix (100x100 or so), then they will see MATLAB as faster simply because it's using MKL as the BLAS instead of OpenBLAS. From what I can see from the Discourse threads people have posted for "why is this MATLAB code faster?", this tends to be the concrete reason for the difference.
I don't think we should ignore this at all. With SciML we actually default to using MKL and AppleAccelerate (Mac M-series chips) internally, i.e. we always ship with an MKL_jll binary (if one exists on your platform) and have CPU checks that choose the default so that LinearSolve is using MKL vs OpenBLAS vs AppleAccelerate to be the fastest one based on benchmarks we have done on each platform. This can make about a 5x-10x difference for many people, and it's one of the reasons why SciML code can be much faster than something simple that is just handwritten in the language. This really demonstrates that for real usage it's not a small factor.
Julia has MKL.jl where if you just do `using MKL` it will override the BLAS shipped with Julia to be using MKL. Note this is not what SciML is doing, it instead directly uses the MKL_jll binary to not flip the libblastrampoline. However... I really want to be doing this to most people's computers. The majority of people would get a speed improvement from this, even with AMD chips sans Epyc. Epyc seems to be the only platform on which MKL is slower than OpenBLAS, and on most platforms it's about 10x faster (on Eypc it's about 2x slower). So if I could pull out a hammer and instantly apply an effect, it would be to just (1) default to MKL (2) change default to AppleAccelerate if M-series is detected, (3) change default to OpenBLAS if Epyc is detected, and 99% of people would see linear algebra go a lot faster. MATLAB does (1) but it doesn't even do steps (2) or (3), so we can beat it there pretty easily. All of the tools to do this exist in the language, it just needs to be set as defaults rather than assuming the user knows to do this.
I'm very curious about whether the Pre-allocation trick is also needed in C++. I don't have any high-performance computing experience, but my narrow understanding is that one need to manually take care of memory allocation for some critical function in a loop by lifting all temporary variables outside of the loop to prevent memory allocation. If that is the case, Julia's preallocation problem is not that bad since it is the limit of today's compiling technique but not Julia.
I also wonder why the transformed Julia code is that slow compared to the MATLAB one, since MATLAB also use garbage collection (GC). Julia and MATLAB should waste similar time on it. Preallocation in Julia can make the code fast, but not doing it should not make it much slower than MATLAB (only due to GC).
I must say preallocating EVERY temporary variable is really tedious, and changing from the simple b=A*x to the mul!(b,A,x) make the code not clean. I think progress should be made here. I don't know whether there is already some package that solve this kind of problem in general. Hope more grammar sugar is offered here, e.g. make the assignment b=A*x to b in place by adding some clever macro here.
in C/C++ preallocation is often lifted not just outside of loops, but also outside of functions. basically, you give each function an array of bytes to use as a scratch space/ place to output results. that way you can reuse memory allocation between multiple functions. so yes, practically every function becomes an in-place one, and the same 4-5 allocations can be rotated to store intermediate results throughout a large pipeline.
newer languages (zig, odin) try to refine this concept into context allocators (does basically the same thing, but the pointer to scratch memory and result memory is passed to operations implicitly, which maintains the "nice" syntax)
This talk is motivated by case study of porting MATLAB project to Julia, so as a counterpoint to it, I will suggested watching Jonathan Doucette talk "Matlab to Julia: Hours to Minutes for MRI Image Analysis" from JuliaCon 2021. How it name suggested porting of MRI image analysing code from MATLAB to Julia with 60X speed gain. Literally, hours were reduced to the minutes!
th-cam.com/video/6OxsK2R5VkA/w-d-xo.html
Interactivity and performance do not go as well together in julia as advertised. As a simple example it is advised to use constants instead of global variables for performance but redefining them in a notebook environment leads to errors/warnings. Similarly structs too can not be redefined without rebooting kernel.
Imo deep learning libraries prove that memory managment should be handled by the compiler. Jax is a compiler written in Python. If they made a language instead to remove its downsides (bad error traces, ugly control flow syntax) then it would be perfect
I wouldn't go that far. Deep learning libraries have proved the limitations of memory management by compilers. For a long time people touted GHC (Haskell) as a clear sign that functional programming languages would rule the world because they can prove certain things about memory to auto-optimize some things that can otherwise be difficult. Jax, as a functional programming language, is simply a third iteration of that now targeted as a DSL to machine learning engineers.
Now, there's a reason why you never see a BLAS/LAPACK written in Haskell and that's because people were always able to find ways to greatly outperform GHC, in practice "a sufficiently smart compiler" is never smart enough. And that's what you see with Jax as well. It's not hard to write a code that's about 10x faster than Jax, see for example SimpleChains.jl hitting about 15x on small neural networks and DiffEqGPU.jl hitting about 20x-100x faster GPU speeds due to using kernel generation rather than array primitives. Another example of this is the llama2c greatly outperforming the Jax translation. So clearly Jax isn't fast because you can pretty easily 10x it if you know what you're doing. Having the ability to do things manually is thus still essential to a fast programming language which is targeting general purpose use.
What Julia should learn from Jax is that for the majority of individuals, a simple memory managements scheme by a smart enough compiler can give good enough results. What Julia is missing is proper escape analysis so that simple mathematical calculations will use a stack rather than a heap and smartly reuse memory. Jax has done this well, and Julia has not and that's probably the main thing that most new users run into. I think improving that experience while also allowing all of the modifications necessary for doing things manually is thus what it needs to evolve into as a general purpose language. I'm actively talking with the compiler team to use some of the SciML manual improvements as examples and test cases for such improvements to the compiler, and there's currently some work going towards such escape analysis features.
@@chrisrackauckasofficialYour comment about Julia and Jax is extremely biased and could not be taken seriously. For one, the real market place doesn't care about the 10x or 100x speedup on small neural networks. Second, it's not a big deal to find out your hand written kernels are faster than array based algorithms, you should really compare kernels written in Julia vs actual kernels written in python to be a valid point.Finally, you seem to be unaware of the fact that torch.compile can compile a piece of numpy code into cpp code for cpu AND cuda code runs on GPU. I'd love to see a comparison with that instead. But I already know that python would win since it's a meta language that can seamlessly generate higher performance code.
@@nickjordan6360 Let's take this one at a time.
(1) "For one, the real market place doesn't care about the 10x or 100x speedup on small neural networks.". That's not the case in all markets. There's a growing market using neural networks on embedded devices as surrogates for things like model-predictive control. In fact, the speaker in this very talk is someone of this market, as the kind of microcontrollers that tend to be on washing machines tend to measure as having on the order of MBs of RAM. These are the kinds of applications which many industries are looking to target some form of learned surrogates.
(2) "Finally, you seem to be unaware of the fact that torch.compile can compile a piece of numpy code into cpp code for cpu AND cuda code runs on GPU. I'd love to see a comparison with that instead.". That is the comparison. The new default JIT in PyTorch is NNC which is a tensor expression fuser, which is a design first done in Halide and is very similar in nature to Jax's JIT. The comparison was done with these. The point though is that the machine learning accelerators do have a JIT heavily optimizes towards the assumption of having specific kinds of tensor operations, and things that are not deep learning (like solving ODEs) do not necessarily have the same structure. You can see these timings in detail in all combinations of vmap and JIT with Jax here: colab.research.google.com/drive/1d7G-O5JX31lHbg7jTzzozbo5-Gp7DBEv?usp=sharing. There's a small overhead to calling the Julia functions since the benchmarking in that link is done on Collab from Python, but it still demonstrates a 10x against JIT'd Jax functions. You can see that the JIT actually doesn't make a noticable difference though, a quick profile would shows you that the dominant cost is between non-fusable operators. The peer reviewed article (www.sciencedirect.com/science/article/abs/pii/S0045782523007156, or open access version arxiv.org/abs/2304.06835) goes into detail describing how this is a direct consequence of the way that the JIT compilation is occurring, showing that you get a similar performance in Julia too if you do the parallelization and compilation in the same way. And this shows that CUDA kernels written directly in CUDA C++ match the performance of Julia.
What this shows then it's not a language thing at all, it's how you do the JIT compilation and the parallelization. The domain-specific accelerators of PyTorch and Jax do this in a very specific way that tends to be good for deep learning problems, but this is a demonstration that it's not a general-purpose accelerator in the sense that it's making deep underlying choices in the archiecture of how that's compiling, and the "how" can be orders of magnitude off from something that is optimized. The general remark here is "of course", in fact some reviewers said it's obvious that the detailed architecture would outperform by an order of magnitude or two, so I'm sure that it's not too surprising of a result, but it highlights scientifically useful cases where directly writing code can outperform accelerators
(3) "I already know that python would win since it's a meta language that can seamlessly generate higher performance code." I don't quite understand what you mean by "win" since there's no competition. There's lots of interesting choices being explored, each having advtanges and disadvantages. No engineering choice can be made without some kind of trade-off! I think the important thing to understand with each tool is the trade-offs being made and the reasons behind these trade-offs. I myself regularly contribute to open source libraries in Julia, Python, and R (writing a bit in C and these days trying some Rust on the side) in order better understand these trade-offs. I think the answer here to the speaker's question is something that is nuanced in dealing with such trade-offs. It's good to know what performance is being missed by high level accelerators and but also some of the usability gains. Jax in particular does some interesting things with memory that would be good to incorporate into Julia to improve memory, though some of the other choices (like vmap and its compilation as highlighted here) are more domain-specific and so the level to which some of the optimizations should be done when the compiler is used in contexts outside of ML is a fairly nuanced topic.
I think that what's really required is a set of optimizations to eliminate memory allocations in contexts where it's easy to approve no escape occurs (similar to Jax and PyTorch JIT), but without trading off that all memory has to be handled through this system. I think it would look similar to C++'s RAII in its lowered form, though feel more like a Jax JIT thing to the high level language user. This would allow for fully preallocated handling by advanced users but get the "standard" user up to Jax/PyTorch JIT speed, would increase the complexity of the compiler a bit but I think that would be a good trade-off.
@@chrisrackauckasofficial TH-cam is deleting my comments so I cant make a response.
@@nickjordan6360 TH-cam has an auto-deletion bot mechanism support.google.com/youtube/answer/13209064. One of the things that can trigger it is too many links that 403 redirect, but also negative or hateful phrases.
Huh? My cellular network simulation ran about 100x faster in Julia after first converting it from Matlab. I've never had a program run slower in Julia compared to Matlab or Python. This case seems unusual to me. Maybe I just program differently. I do pay attention to optimize inner loops and use @view where appropriate.
I have never experienced julia being slower in this way when compared to matlab in any of the very computational heavy scripts i’ve written. To address the main point yes really bad julia code is going to be worse then really good matlab code. I agree that julia should make it easier to write good code, that is true for every language and every language should be faster, but this just doesn’t seem like a fair comparison. If I were to steal man this guys point though. It seems like julia can make a lot of head way in terms of ease of implementing performant code by taking a second look or overhauling its approach to temporary variables and when allocations are happening and in what scopes.
One can easily write in efficient code in matlab than in julia.