43:51 The copy is also extremely incorrect with memcpy if your key/value aren't trivial. For trivial types, replacing copy-constructor body with "= default" should generate the same code.
You are right about the triviality requirement (I didn't mention it, but I should) - regarding the quick copy: I am talking about copying the entire map, which can be done for this data structure with a single memcpy call (again, for trivial types), which is not the same as calling the copy constructor for each object (even for trivial objects)
@45:20 Of course -march or -mtune will degrade latency performance, they are designed to maximise throughput. SIMD instructions will of course have higher latency than their scalar counterparts.
I’m new here. Currently I’m studying about computer architecture before making a HFT low latency software because hardware is important. I know c++, working heavily on that too. How would anyone guide a beginner who wants to contribute in low latency coding ?
I would learn as much as possible on advanced C++ features and paradigms (such as CRTP, which is very useful). Understand memory, caching (memory and CPU) branch prediction and efficient memory allocation and pipelining - all of those things become extremely important when writing low latency code
@@nimrodsapir3256 feel really good that those are the only things I’m focusing on. I learned my basic c++ from “C++ primer plus by Stephen prata” and have an ongoing course on computer architecture. For more advanced c++, I’ll take help from “c++ by Bjarne Stroustrup”.
Wouldn't it be better to use custom version of compiler that generates code that warms up your cache instead of constantly fighting with "leaky abstractions"?
Probably not. The extra cost of building your own compiler would exceed the benefit. It would require not merely compiler-oriented programmers, but also that those programmers become highly versed in the CPU-specific optimizations. These sorts of developers are harder to find. More code to maintain, and more developers means higher costs, for a hard-to-measure benefit. Also, this is a good moment to say that I'd love to go further - have a compiler that simulated the internal state of the CPU (which isn't always very well documented) and optimized the uops and evened out the port-pressure (e.g. how to distribute the work among the ALUs in each core). It might be tricky to provide compile-time core-associativity to allow for such optimizations, but it could be done!
@@LordNezghul I *was* only talking about the work of building a new backend for the compiler. The part of the compiler that decides how to convert the IR into the binary (e.g. "x86 bytecode" for the specific x86 CPU you want to optimize for). It's still a non-minor undertaking, not just to do it, but to prove that there is an improvement from the optimizations. Even discovering the optimizations would become a full-time job, especially while trying to keep up with new Intel hardware.
Thanks for your comment, and I have to ask - This custom compiler you describe - it will have to detect (at compile time) the flows which are rarely executed, but are business critical (you don't want to just warm up all your code, just those specific flow), which is not something that I think can be deduced automatically. Also, the generated code should run without side-effects, which is a very tricky definition (some counters may be harmless if accessed by the warmup code, while others must be replaced with a mockup). Again, it is very likely I am missing something here...
Just a guess, but besides that maintenance will be a nightmare, often compilers are better in optimization than people are when writing in assembly themselves.
Just to comment - these days we have the ways to run our logic end to end inside the userspace (we are using specialized network cards and drivers). So as far as the kernel is configured to allocate the resources we need, we do not require to write any kernel code.
@@JMRC To add to this - compilers are better at optimising than humans, but you can always look at this disassembly produced from compiled c++ and try add different optimisations that way, much easier to do than write assembly from scratch.
@@bibekkoirala8802 An RTOS runs just fine on a modern Intel space heater. I'm saying you'd have complete control over what is in the ISRs and be better suited to manage latency. Heck you could even write a thread that NEVER leaves context.
@@paulmccumber9291 space heater lmaooo. AFAIK they cut down all fluff from linux kernel, modify the networking layers(kernel bypass) and other performance modifications shit. So, they do get RTOS-like benefits from linux, maybe not hard real-time but close. IMO pure RTOS is better suited for something like sampling audio signals in real-time where you don't need networking protocols and shit like that. Just my views, I don't work in audio or HFT.
Indeed, some of the hfts do what you say. the bare metal code, but usually on a SoC (FPGA net stack + FPGA or ARM algo impl depends on the complexity). I think they are just keen on moving the impl/logics to HW as much as possible.
Basically, the idea is to bypass the OS services in real-time altogether (ideally, all the memory is pre-allocated, kernel bypass for the networking, and pinned and spinning threads for the real time threads). So the OS scheduling will only handle the administrative tasks of the system. Beyond that, FPGA indeed can give even high performance, but adds a lot of limitations of course
If you don´t speak English, then choose your own language because is very annoying listening to someone trying hard to find the proper words to express himself in a foreign language.
@@spicetard249 Fuck that shit, all we need to do in life is take care of ourselves. I say, if some asshole is wasting his time helping others, just take advantage of him.
43:51 The copy is also extremely incorrect with memcpy if your key/value aren't trivial. For trivial types, replacing copy-constructor body with "= default" should generate the same code.
You are right about the triviality requirement (I didn't mention it, but I should) - regarding the quick copy: I am talking about copying the entire map, which can be done for this data structure with a single memcpy call (again, for trivial types), which is not the same as calling the copy constructor for each object (even for trivial objects)
@@MultiNimrods under the assumption of type triviality, doesn't the compiler optimize the individual copies away using memcpy in this case?
@45:20 Of course -march or -mtune will degrade latency performance, they are designed to maximise throughput. SIMD instructions will of course have higher latency than their scalar counterparts.
I’m new here. Currently I’m studying about computer architecture before making a HFT low latency software because hardware is important.
I know c++, working heavily on that too.
How would anyone guide a beginner who wants to contribute in low latency coding ?
I would learn as much as possible on advanced C++ features and paradigms (such as CRTP, which is very useful). Understand memory, caching (memory and CPU) branch prediction and efficient memory allocation and pipelining - all of those things become extremely important when writing low latency code
@@nimrodsapir3256 feel really good that those are the only things I’m focusing on. I learned my basic c++ from “C++ primer plus by Stephen prata” and have an ongoing course on computer architecture.
For more advanced c++, I’ll take help from “c++ by Bjarne Stroustrup”.
Wouldn't it be better to use custom version of compiler that generates code that warms up your cache instead of constantly fighting with "leaky abstractions"?
Probably not. The extra cost of building your own compiler would exceed the benefit. It would require not merely compiler-oriented programmers, but also that those programmers become highly versed in the CPU-specific optimizations. These sorts of developers are harder to find. More code to maintain, and more developers means higher costs, for a hard-to-measure benefit.
Also, this is a good moment to say that I'd love to go further - have a compiler that simulated the internal state of the CPU (which isn't always very well documented) and optimized the uops and evened out the port-pressure (e.g. how to distribute the work among the ALUs in each core). It might be tricky to provide compile-time core-associativity to allow for such optimizations, but it could be done!
@@ShalomCraimer I think there is no need for building entirely new compiler from scratch but maybe just few extensions for existing compilers.
@@LordNezghul I *was* only talking about the work of building a new backend for the compiler. The part of the compiler that decides how to convert the IR into the binary (e.g. "x86 bytecode" for the specific x86 CPU you want to optimize for). It's still a non-minor undertaking, not just to do it, but to prove that there is an improvement from the optimizations. Even discovering the optimizations would become a full-time job, especially while trying to keep up with new Intel hardware.
Thanks for your comment, and I have to ask - This custom compiler you describe - it will have to detect (at compile time) the flows which are rarely executed, but are business critical (you don't want to just warm up all your code, just those specific flow), which is not something that I think can be deduced automatically. Also, the generated code should run without side-effects, which is a very tricky definition (some counters may be harmless if accessed by the warmup code, while others must be replaced with a mockup). Again, it is very likely I am missing something here...
The 1st video watched in 2x speed, also skipped initial 20 minutes used as cache warming... Should have been applying HFT algos ;)
There are a lot of invalid suggestions and recommendation in this video.
hi thanks its a good . i have a question why using cpp instead of using your own os drivers and assembly lang? why using linux kernal and cpp?
Just a guess, but besides that maintenance will be a nightmare, often compilers are better in optimization than people are when writing in assembly themselves.
Just to comment - these days we have the ways to run our logic end to end inside the userspace (we are using specialized network cards and drivers). So as far as the kernel is configured to allocate the resources we need, we do not require to write any kernel code.
@@JMRC To add to this - compilers are better at optimising than humans, but you can always look at this disassembly produced from compiled c++ and try add different optimisations that way, much easier to do than write assembly from scratch.
Why not use an RTOS? Or even bare metal code running application specific code?
They use multicore high end state-of-art processors, not microcontrollers
@@bibekkoirala8802 An RTOS runs just fine on a modern Intel space heater. I'm saying you'd have complete control over what is in the ISRs and be better suited to manage latency. Heck you could even write a thread that NEVER leaves context.
@@paulmccumber9291 space heater lmaooo. AFAIK they cut down all fluff from linux kernel, modify the networking layers(kernel bypass) and other performance modifications shit. So, they do get RTOS-like benefits from linux, maybe not hard real-time but close. IMO pure RTOS is better suited for something like sampling audio signals in real-time where you don't need networking protocols and shit like that. Just my views, I don't work in audio or HFT.
Indeed, some of the hfts do what you say. the bare metal code, but usually on a SoC (FPGA net stack + FPGA or ARM algo impl depends on the complexity). I think they are just keen on moving the impl/logics to HW as much as possible.
Basically, the idea is to bypass the OS services in real-time altogether (ideally, all the memory is pre-allocated, kernel bypass for the networking, and pinned and spinning threads for the real time threads). So the OS scheduling will only handle the administrative tasks of the system. Beyond that, FPGA indeed can give even high performance, but adds a lot of limitations of course
Which programming languages are best to learn today for high frequency trading?
c++ for speed, python for ease, but to directly answer, it is c++
C++ but maybe start with python if ur new to programming imo
Rust is gaining big traction rn in more agile firms
@@draked8953 Could you name those agile firms? Really curious
@@draked8953High Frequency Trading, Low Frequency Development
Why Cpp over C, if performance is of ultimate importance?
Also what about Rust vs Cpp?
How about C? I love C++ but I feel like C is right next to the hardware.
c dosent have stl and templates
.. They both are? C++ is more or less an extension to C and maintains all original C libraries s
@@ghdshds1899 Debug some C++ and watch it bounce up and down V Tables
@@ghdshds1899 If you ever debug C++ you spent a lot of time bouncing around v tables. I love C++ but for pure speed, I'd think C would be faster.
@@personaladdress3539 I''m not arguing from a language strength standpoint but if pure speed is an issue, C++ spends lots of time in the V tables.
I am surprised that trading volume is just 50% lmaooo
Thx
interesting video thanks for making it. One tip, stand still. you shift side to side and as you are on screen its very distracting. :-)
Have you ever lectured before ?
@@comitcrafter yes I have quite a lot and I have been videoed doing it.
Thanks for the tip (really) - this was my first time doing such a long lecture so I was quite nervous...
Miller Gary Lee Larry Martin Karen
If you don´t speak English, then choose your own language because is very annoying listening to someone trying hard to find the proper words to express himself in a foreign language.
Stupid comment, nothing wrong with his English.
at least he is trying to help others
@@spicetard249
Fuck that shit, all we need to do in life is take care of ourselves. I say, if some asshole is wasting his time helping others, just take advantage of him.
@@totenkopf30 you must be fun at parties
@@peterhooper2643 what parties asshole, I fucking hate human beigns