CppCon 2018: G. Nishanov “Nano-coroutines to the Rescue! (Using Coroutines TS, of Course)”

แชร์
ฝัง
  • เผยแพร่เมื่อ 25 ส.ค. 2024

ความคิดเห็น • 55

  • @AK-vx4dy
    @AK-vx4dy 2 ปีที่แล้ว +2

    Best short clear introduction to coroutines. First 4 minutes should be included in coroutines wikipedia.

  • @JesseBusman1996
    @JesseBusman1996 4 ปีที่แล้ว +7

    Cool :)
    Good takeaway is to consider random access into RAM as a blocking I/O call just like hard drive reads.

  • @broken_abi6973
    @broken_abi6973 5 ปีที่แล้ว +16

    The other talk from Kris Jusiak shows that co-routines' performance is bad for state machines compared to alternatives, and this talk shows the contrary. It would be interesting to understand why we have these diverging results.

    • @petermuller9518
      @petermuller9518 5 ปีที่แล้ว +2

      This talk has emphasized the ease of use of coroutines to achieve cooparative multitasking (e.g. executing another state machine while one is blocked for memory access) while the boost::sml talk did not use parallelism.
      Nevertheless the boost::sml talk was more about comparing different approaches and in my opinion the main take away was "you can have declarative state machines with acceptable performance"

    • @broken_abi6973
      @broken_abi6973 5 ปีที่แล้ว +6

      Yes, but it seems there are some overheads associated with co-routines that we should be aware of. Just saying that "you can have declarative state machines with acceptable performance" felt very unsatisfactory to me.

    • @GorNishanov
      @GorNishanov 5 ปีที่แล้ว +25

      Coroutines are suitable to represent state machines that can be expressed as imperative control flow. They cannot replace all state machines. Sometimes it is simpler to write a state machine as a coroutine, sometimes it is the other way around. Which will give you better performance will depend whether optimizations that compiler do on the coroutine before it is split up into a state machine will make the resulting state machine simpler that the one that you can write by hand in reasonable time. Both approaches are valuable and, as usual, selecting the right tool for the job will give you the best result.

    • @travisdowns5469
      @travisdowns5469 5 ปีที่แล้ว

      What is the "other talk" by Kris Jusiak?

    • @rqstaffan
      @rqstaffan 5 ปีที่แล้ว +3

      @@travisdowns5469 th-cam.com/video/yZVby-PuXM0/w-d-xo.html

  • @damieng7636
    @damieng7636 ปีที่แล้ว

    that's really cool! thanks for the talk

  • @elisa4776
    @elisa4776 3 หลายเดือนก่อน

    four videos later I finally (kinda) get what coroutines are...

  • @PrzemyslawSliwinski
    @PrzemyslawSliwinski 2 ปีที่แล้ว +1

    26:13 - enlightening experience!

  • @the_faminist
    @the_faminist ปีที่แล้ว

    Captions: whatever wonderful (mumbles) you're trying to achieve
    Sounded like he said "whatever wonderful cache locality you're trying to achieve"

  • @PixelPulse168
    @PixelPulse168 5 ปีที่แล้ว +4

    It's a new mindset of programming. We need more examples.

    • @baiwfg2
      @baiwfg2 2 ปีที่แล้ว +1

      Yes, I think something like prefetch is a quite esoteric facility. Maybe we need more normal examples

  • @harryliang9263
    @harryliang9263 3 ปีที่แล้ว

    Multiple threading vs. multiple threading + coroutines? Who will win the ultimate speed?

  • @movax20h
    @movax20h 4 ปีที่แล้ว

    Is it possible to compile different parts of your code that uses co routines separately and then link? How does that work? I guess the main issue is with the implicit state and size of it. It can't be encoded in signature or mangled name, that is simply not scalable in big projects. So that forces it to not be a part of signature, and state being returned as pointer on heap, forcing you to use allocator (granted one could use some optimized thread-local allocator for it, or even back it by stack and then fallback to other means if needed), but adds indirections.

  • @berestovskyy
    @berestovskyy 5 ปีที่แล้ว +1

    PREFETCH NTA actually pulls data into L1 and L3 caches on modern Xeons, since L3 is a point of coherency...

    • @GorNishanov
      @GorNishanov 5 ปีที่แล้ว +1

      Thank you! Good to know.

    • @movax20h
      @movax20h 4 ปีที่แล้ว

      I think the main point of NTA, is to not invalidate the cache line in other cores. It is already close, but is being written or used by other core, so better leave it there until the last moment you actually want to write into it (or read from it). Then you can also have interactions of prefetch nta with NT loads and stores, but it is rather orthogonal.

  • @telishevalexey2283
    @telishevalexey2283 5 ปีที่แล้ว +1

    Interesting thing is that even if you remove the prefetch instruction, corutine (and handwritten version) is still faster than the naive one. Gor, do you have any idea why is that happening? Is it because OOO engine have more instructions to work with?

    • @GorNishanov
      @GorNishanov 5 ปีที่แล้ว +1

      Very interesting observation! I just checked it and indeed, on the machine I was doing the measurements, removing prefetch slowed down the coroutine version from 9ns to 17ns per lookup, but, it was still faster than 30ns of the naive implementation. I don't know the reason, your guess seems plausible to me. If there are experts on out of order execution on the modern hardware lurking around, your input is appreciated. It is quite amazing that hardware is able to pull it off!

    • @gearg100
      @gearg100 5 ปีที่แล้ว +3

      @@GorNishanov In rough terms, OOO execution hides latency as long as it can find independent instructions in the ~200 instructions that follow the blocking one. In case of pointer chasing, most of the instructions that follow a memory access depend on it and this lack of independent work is why the processor stalls. With interleaved execution, we introduce independent instructions between a load and the next instruction that depends on it. The OOO engine can execute the added instructions while waiting for the first load to complete. However, this load is blocking the pipeline, preventing all instructions that followed from retiring. This means, the processor can continue execution until its instruction window becomes full (after ~200 instructions on modern processors), at which point the processor has to wait until the load finishes. Contrary to this behavior, a prefetch is not blocking the pipeline, allowing any number of instructions to run after it, hence the performance improvement.

  • @xbreak64
    @xbreak64 5 ปีที่แล้ว

    Source code: github.com/GorNishanov/await/tree/master/2018_CppCon

  • @Omnifarious0
    @Omnifarious0 5 ปีที่แล้ว +1

    He talked about a Github repository of his code in the talk. Where is it?

    • @b14ckj4ck
      @b14ckj4ck 5 ปีที่แล้ว +1

      Probably here github.com/GorNishanov/await/tree/master/2018_CppCon
      Unfortunately the code has not been added yet

    • @GorNishanov
      @GorNishanov 5 ปีที่แล้ว +1

      The samples will appear here this week: github.com/GorNishanov/await/tree/master/2018_CppCon

  • @RushPL1
    @RushPL1 5 ปีที่แล้ว +4

    Node.JS-like AIO code in C++ - voting YES!

  • @YourCRTube
    @YourCRTube 5 ปีที่แล้ว +2

    As much as I am excited about coroutines, the extension points still make me uncomfortable - it will be one of the most complex thing to extend, comparable to writing an allocator for instance. How many people will do it. How many people will do it right? Hopefully things will improve when at least some pieces are ready-made like 'task'. Concepts will probably also make things better as the interface will become more visible.

    • @GorNishanov
      @GorNishanov 5 ปีที่แล้ว +3

      They make me uncomfortable too :-). What I hope will mitigate the complexity is that it is layered. Say, 2 million of C++ developers will be using just the top level syntax with tasks and generators provided by the standard or high-quality open source libraries (essentially C# or Python experience). 10,000 developers need to understand the concept of awaitable to extend awaiting to some domain specific APIs they need to work with. Less than 1,000 developers will need to know how to define new coroutines types.

    • @YourCRTube
      @YourCRTube 5 ปีที่แล้ว +1

      ​@@GorNishanov Hello, and thanks for the replay. In the last weak I took the time to read/watch all about coroutines that there is to read/watch, I believe literally all. My conclusion is - they are *not complex*, nor complicated and every developer should be able to create new types, using the base building blocks.
      That being said, I believe the issue people have is exclusively due to confusing naming. This is evident by the fact all tutorials have to add comments and explain what the code is doing and what it means. In other words I believe there could be massive improvement by better naming.
      I will give you some examples await_ready reads like a signal/callback - "it will be called when the coroutine is ready" (whatever that means, but probably resumed). This is not the case of course. await_suspend reads like an action - "perform suspend by implementing this function". On the promise side initial_suspend again sounds like an action ("do the initial suspend") and noone could guess it returns an object that is important, not to mention the object type alone is completely new and odd enough for a newcomer. The name promise is also a source of confusion and every teacher will have to comment that is not related to std::promise and possibly no teacher could explain why it is called like that.
      I really hope the design space is still open, because, I believe, the issue is more then just a bikesheding. Thank You

  • @RoyBellingan
    @RoyBellingan 5 ปีที่แล้ว

    Now I'm thinking with coroutine!

  • @Voy2378
    @Voy2378 5 ปีที่แล้ว +3

    Gor if you are reading this please kill co_await and use Google proposal syntax... I would prefer await, but anything is better than co_await

    • @GorNishanov
      @GorNishanov 5 ปีที่แล้ว +3

      Yes, I am reading this. Here is one of my earlier attempts to fix that: open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0071r0.html . It went down in flames. The "[

    • @Voy2378
      @Voy2378 5 ปีที่แล้ว

      that sucks... thank you for reading the comment :) btw i have another question: Am I right that just writing a batch function that does 16(or 20 or 8...) lookups without coroutines would be equally effective because compiler would move loads before computation, so you would get the same results, and maybe even faster because no coroutine switches happen?

    • @GorNishanov
      @GorNishanov 5 ปีที่แล้ว +1

      Was not a hand-crafted state machine that was built fin the first part of nano-coroutines presentation an example of such batching? It ended up being slightly slower than a coroutine, but, it is possible that by playing more with non-coroutine version you can trick the optimizer to make it as efficient (or even more) than the coroutine version.

    • @Voy2378
      @Voy2378 5 ปีที่แล้ว

      Well you do not need a state machine, so I think it may be faster... If you could share your code on github I may be even bored enough to do it myself. :)

  • @Sopel997
    @Sopel997 5 ปีที่แล้ว

    1:03:00 You said that with only one search you can't do better, but wouldn't it be possible to split the search space and treat it like multiple searches? I was thinking about sequentially going through the first part (above 'red squiggly line') of the binary search function for each search section (ending on a prefetch) and then sequentially for the lower one (keeping separate states for each search section, like with coroutines, but not actually requiring them). This would even eliminate completely the overhead of switching them, right (or is it comparable to incrementing a pointer in the optimized code)?
    Also doesn't look like the presentation is on github?

    • @nullplan01
      @nullplan01 5 ปีที่แล้ว +2

      That does not help you with binary search. If you split the search space in half and then do a binary search in each half independently, one of your halves is being searched pointlessly, and which one it is you will see after a single step of the algorithm.

  • @Inityx
    @Inityx 5 ปีที่แล้ว +5

    He keeps saying that C++ coroutines are fundamentally different from other languages; aren't Rust's coroutines almost exactly the same as C++'s?

    • @BillyONeal
      @BillyONeal 5 ปีที่แล้ว +3

      I seem to recall a statement saying that Rust coroutines actually use the same infrastructure that was built into the LLVM IR (by Gor) to support C++ coroutines.
      The main point here is that stackful coroutines / 'fibers' are not a good path forward.

    • @jimlin897
      @jimlin897 5 ปีที่แล้ว +7

      As far as I known, C++ 20 coroutines are stackless, which means it is almost a pure syntax transformation. The only extra cost is heap allocation. I believe it can be well solved by preallocated memory. Coroutine switching cost should not be more than 2 ops. 1 for setting pointer, 1 for jmp.

    • @cgoobes
      @cgoobes 5 ปีที่แล้ว

      @@BillyONeal Fibers seem pretty awesome though. Are you saying they just shouldn't be part of what is a "coroutine" going forward or that they shouldn't be part of the standard at all?

    • @BillyONeal
      @BillyONeal 5 ปีที่แล้ว

      @@cgoobes Yes, I think fibers are worse than useless; anywhere you'd use a fiber I'd rather see use of a thread instead.

    • @cgoobes
      @cgoobes 5 ปีที่แล้ว +2

      @@BillyONeal Why? User mode context switching is FAR faster than OS level context switching that thread require. And most async applications don't need threads, they just need need to switch in long enough to dump their IO operation into the operating system or delegate the return data to a handler in a real thread pool. Fibers work perfectly as the async io "management" system.

  • @OperationDarkside
    @OperationDarkside 5 ปีที่แล้ว

    Looks awesome. But it's not so easy to follow for a beginner

  • @andy.1331
    @andy.1331 5 ปีที่แล้ว

    I'm working with concurrency runtime part of Microsoft SDK for a long time. Microsoft released support to easy migrate from concurrency runtime to co-routines. That's why I have a chance to experiment with co-routines using my concurrency based existing code. There is a big fundamental trouble with using co_await: actually, binary runtime left your current procedure just in place it meets co_await and calls destructors for all in-scope objects. Then it comes back (if co_await is ready) to the line exactly after the co_await operator. At this point all stack objects declared before co_await are broken. As a result we have broken object scope lifetime paradigm. BTW - object's lifetime is a corner stone of C++ paradigm. That's why await technics are best used in "managed" languages in cases of using garbage collectors where scope and stack are isolated from programmer. In C++ to be safe with all our in-scope objects we'are using before and after co_await we have to use smart pointers. But it's not obvious, it's costly and it's for sure source of runtime troubles. That's why I'm against co_await's in C++. I've being enthusiastic even tried to use co-routines in my projects, but ... result is absolutely unacceptable. Using tasks based concurrency I really control objects' lifetime, because concurrency is based on lambdas.
    I'm just curious why Gor never talks about this issue ;-)

    • @llothar68
      @llothar68 4 ปีที่แล้ว +1

      Then this is a compiler error. Every stack object across co_await calls will be put into a coroutine frame object. Thats the first half of the magic there is with coroutines.

  • @TheEmT33
    @TheEmT33 5 ปีที่แล้ว +4

    I love Russian accent