Let me save you an hour: #embed (which is basically a binary #include to embed data from files into static arrays compile-time) is proposed, but not happening yet. Fastest way to read data from the hard drive is still mapping it to avoid buffered reads, there's still no standardized solution for that in C++, although there's a bunch of proposals submitted (std::map_file is still not happening). On windows 11 and RTX3090 you have DirectStorage that allows direct transfer of file data into the VRAM, bypassing RAM/CPU.
This is by far the worst cppcon talk I have ever seen, his information about DirectStorage is also wrong, it won't bypass Ram/CPU, but just bypass the majority of the traditional IO stack on windows.
@@billgrant7262 buffered reads copy data into a temporary buffer before giving it to you, which allows easier logistics but copying to an intermediate buffer has a cost associated with it.
"I should have taken it as a sign of things to come when, in 2015, Jon Kalb tried (emphasis on tried) to introduce me to Bjarne. Bjarne looked me up and down, put on a face of disgust, and walked away."
Wish there were more solutions in the talk but interesting none the less. Essentially all the staggered steps I've run into as I've been trying to make one of my programs.
It's unbelievable that C++ still has no std::embed. This feature is very useful and easy to implement by compiler authors. Now i need to implement my own embedding via external script that generates a header file, which i later include.
@@heavymetalmixer91 That argument doesn't make much sense. People have been wanting to include stuff into executable for pretty much ever, and C only got this facility in the future. C software tends to be super conservative with features since it often has to target remarkably odd and wonky toolchains. And C++ is exceptionally big in embedded since you can have true zero cost abstractions - just no STL on microcontrollers, no RTTI, no exceptions, usually limited to no dynamic allocation, some oddities.
Depending on which toolchains you have to target, you can have a macro that emits inline assembler with 2 symbols, one being incbin of the file and the next being an end-marker. Difference is length. So you can possibly have it all within your C++ code even though it's strictly speaking not implemented in C++. Haven't watched the talk, don't know if he mentions it.
Lots of history in this talk leading to presenting summary description of three papers p1040 p1031 and p108? all dealing with improvements to fileio for c++ or the c standard. Could be some better details on the costs, but its only an hour format.
43:16 I feel like reading a memory mapped file on a different thread is a hacky way to do it. Linux has the mmap system call option MAP_POPULATE. Granted this might not be portable, but at least your program doesn't need to make assumptions about how the page faults correlate with disk IO requests and performance should be similar if not better than doing it manually.
No, you are totally wrong about DMA and MMUs. The I/O driver creates an array of PRD structures (physical region descriptors), which give the device an array of {start,length,flags} structures which tell the device where the kernel has mapped that range of address space in physical memory. Also, on top of that, there is another layer called an IOMMU, which is capable of making the physical addresses go through another translation layer, so a virtualization platform can make a guest OS think it is dealing with physical addresses, when those are transparently remapped. Almost every device uses DMA, but they call it "bus mastering" now. Same thing, bus mastering is just more general - you can do I/O port accesses, too, and it might not be memory access, you can bus master MMIO ports. You can do "peer accesses", where one device can access another, you could write an image straight into video RAM from an AHCI host controller. The CPUs RAM isn't necessarily involved. I have implemented many modern hardware drivers since my VIC-20.
A person from the audience claims that on windows there's currently no way to reserve the address range for a mapped file without actually allocating RAM corresponding to that address range. If I understand that comment correctly, it means that if you want to mmap a 32GB file on windows, you'd have to allocate 32GB of RAM just to reserve its address space. The way it's supposed to work is that mmap should only reserve the address space corresponding to the file size and then virtual memory is supposed to redirect all memory reads from the mapped memory range to the hard drive, which has to be implemented on the OS level (but the commenter claims on windows it's not). Can anybody confirm this or correct me?
I think the way it's supposed to work is what they were trying to describe, but using windows terms. Like if you check task manager in the memory section, you should see "In use" vs "Committed" where in this case, you would add 32GB to the committed section but not the In use section. What they were probably trying to say was that there isn't away to map less of the file and that it *must* map the whole file to the address space.
The commenter is wrong. When file mapping view is created then only the address space of view-size is allocated without any committed physical storage. When any portion of the view is accessed by the CPU then the regular 'page fault' mechanism is triggered and the kernel maps physical memory to the accessed address space pages with the content of the appropriate file section. The only difference to the regular virtual memory is that the address space of the view is not backed by the common pagefile but the file being mapped, so when the physical memory is paged out then it can simply be discarded and writing-back-to-the-hdd is only needed if it is modified. The whole mechanism is not much different to allocating address space by VirtualAlloc (flag MEM_RESERVE) and then allocating physical storage behind the individual pages by VirtualAlloc (MEM_COMMIT). File mapping and regular memory allocation work the same way but it's completely automatical. What the commenter probably thinks of, is that the address space is always allocated which can be very large for certain file mappings. If you map a large file portion but only use a small portion inside it then address space is needlessly wasted which can lead to out-of-memory conditions when allocating plain memory later in the application. This was indeed a problem with 32 bit applications in the 2GB address space. But it shouldn't be with 64 bit applications.
I do understand that doing file I/O in a way that made sense on 1970's hardware does not necessarily make sense on present-day hardware. But I wonder if this can't be solved under the hood. I mean I don't expect that the number 4k for the buffer size is in the standard. That could be increased. And does the standard say anything that prevents an implementation of querying the file size and than set the buffer to that number (with some cap for really huge files)? So that in 90% of all cases there is only 1 chunk and the buffering is purely virtual? You can be a bit creative can't you? And I expect that the iostreams library is specified in quite abstract terms. I don't expect that it prescribes to be implemented in terms of fopen and friends. It could also be implemented in terms of that fancy OS API he was talking about. If you really want to be nice to people running ancient legacy software on ancient legacy hardware, or arduino and other embedded hardware, there could be an opt-out through a compiler option that selects the ancient implementation. So I wonder, isn't he barking up the wrong tree? Is it really necessary to push the committee for new features and could he not just as well lobby with standard library implementers to reimplement the existing interface but optimised for modern hardware?
The nice thing about C++ is that it is so generic that it knows it needs to allow you to adjust the buffer size of your I/O stream. The annoying thing about C++ is that at its defaults fopen, fstream etc is slower than Python's open because they do different things. By modifying the buffer size of my fstream to 8MB I was able to read through the file ~20% faster. But it took a week of research to figure out that this is a valid method used by several other people. And that it's still not the fastest solution. The fastest would be to mmap the file and pass those locations to my threads instead of reading through the whole thing in order to do essentially the same segmentation. Filesystem is required to be able to get the size of the file unless you rely on the OS. If you rely on the OS you are making a separate file management program for each system. There's other tricks like directly moving the buffer data into the destination char[ ] instead of copying from buffer to the char[ ]. Or creating a new string each time you create a buffer so that you can just switch the pointer over to another pointer. All of that to get close to Python's default behaviour but never actually attain it... Because Python is actually mapping the file ahead of time. But it also means that if your data size is consistent you can specify a chunk to the be expected size and not need to parse backwards. I guess? So why not mmap the files? (unbuffered read I think is what this corresponds to) This allows your program to know where to place the pointer without 'reading' the data into a buffer. Allowing you to get to any point of the file extremely quickly. But as he says at the end of the talk there is no standard pre-made way to do so. I've found a library call Boost which does this, or you will need to use the mapping method provided by the OS you're working with. Except, hahah, Windows doesn't have one. This also doesn't solve for the situations where you KNOW you want some specific data when you start some program. Which is where the embed comes into play. Essentially allowing you to automatically send the data to the hardware that will use it without needing to run through the binary via an I/O stream.
FAT32 has a strong use case: SD-Card connected to a microcontroller. If you have a 3D printer, or a data logger, or one of many kinds of gadgets of this type, they're going to pretty much need to use FAT32. You MIGHT implement exFAT these days but just a couple years ago it wasn't at all an option for intellectual property reasons, so it's still a little rare.
RAM is not an another kind of cache. All kinds of cache made in purpose to have access to the RAM. You can't address cache, you only can address the RAM. Only RAM maters.
Not all addressable memory is random access, and ram is often used as cache for yet deeper layers of data. Nor are instructions required to be addressable in RAM or cache. On the other side there was a time when main memory was called "storage" and the only persistent records were made on paper (Printing, automaticly punched cards and punched tape.)
I don't think he meant it literally is a cache, like L1 cache, L2, etc but rather the abstract meaning where its just a cache of data from, say, the hard disk. Which makes more sense in the context of virtual memory where unused memory can be paged out to the disk which is much slower than RAM.
Your memory hierarchy, not from the machine architecture perspective but from the data perspective, could be registers - L1 L2 L3 - DRAM - disk - network. At this very moment, you're interacting with files on disk which cache Internet requests. And those in turn run through a CDN which can be seen as another set of layers in the cache hierarchy, since you're getting local copies from a server within 2000km of you of distributed worldwide data.
Let me save you an hour: #embed (which is basically a binary #include to embed data from files into static arrays compile-time) is proposed, but not happening yet. Fastest way to read data from the hard drive is still mapping it to avoid buffered reads, there's still no standardized solution for that in C++, although there's a bunch of proposals submitted (std::map_file is still not happening). On windows 11 and RTX3090 you have DirectStorage that allows direct transfer of file data into the VRAM, bypassing RAM/CPU.
This is by far the worst cppcon talk I have ever seen, his information about DirectStorage is also wrong, it won't bypass Ram/CPU, but just bypass the majority of the traditional IO stack on windows.
So what are and what's wrong with buffered reads?
@@billgrant7262 buffered reads copy data into a temporary buffer before giving it to you, which allows easier logistics but copying to an intermediate buffer has a cost associated with it.
The power suit of the software development world (beach shirt) makes a glorious comeback. Quality talk is guaranteed.
"I should have taken it as a sign of things to come when, in 2015, Jon Kalb tried (emphasis on tried) to introduce me to Bjarne. Bjarne looked me up and down, put on a face of disgust, and walked away."
the talk starts at about 37:00
Phenomenal talk from Guy as always. And looking fabulous
Wish there were more solutions in the talk but interesting none the less. Essentially all the staggered steps I've run into as I've been trying to make one of my programs.
It's unbelievable that C++ still has no std::embed. This feature is very useful and easy to implement by compiler authors.
Now i need to implement my own embedding via external script that generates a header file, which i later include.
Talking from ignorance here but I think it's because most embedded devs preffer C to C++.
@@heavymetalmixer91 That argument doesn't make much sense. People have been wanting to include stuff into executable for pretty much ever, and C only got this facility in the future. C software tends to be super conservative with features since it often has to target remarkably odd and wonky toolchains. And C++ is exceptionally big in embedded since you can have true zero cost abstractions - just no STL on microcontrollers, no RTTI, no exceptions, usually limited to no dynamic allocation, some oddities.
Depending on which toolchains you have to target, you can have a macro that emits inline assembler with 2 symbols, one being incbin of the file and the next being an end-marker. Difference is length. So you can possibly have it all within your C++ code even though it's strictly speaking not implemented in C++. Haven't watched the talk, don't know if he mentions it.
Lots of history in this talk leading to presenting summary description of three papers p1040 p1031 and p108? all dealing with improvements to fileio for c++ or the c standard. Could be some better details on the costs, but its only an hour format.
43:16 I feel like reading a memory mapped file on a different thread is a hacky way to do it. Linux has the mmap system call option MAP_POPULATE. Granted this might not be portable, but at least your program doesn't need to make assumptions about how the page faults correlate with disk IO requests and performance should be similar if not better than doing it manually.
Excellent talk!
Very interesting talk
No, you are totally wrong about DMA and MMUs. The I/O driver creates an array of PRD structures (physical region descriptors), which give the device an array of {start,length,flags} structures which tell the device where the kernel has mapped that range of address space in physical memory. Also, on top of that, there is another layer called an IOMMU, which is capable of making the physical addresses go through another translation layer, so a virtualization platform can make a guest OS think it is dealing with physical addresses, when those are transparently remapped. Almost every device uses DMA, but they call it "bus mastering" now. Same thing, bus mastering is just more general - you can do I/O port accesses, too, and it might not be memory access, you can bus master MMIO ports. You can do "peer accesses", where one device can access another, you could write an image straight into video RAM from an AHCI host controller. The CPUs RAM isn't necessarily involved. I have implemented many modern hardware drivers since my VIC-20.
Having listened to the talk, I couldn't find a place where he said anything differently? Could you be so kind to provide a time stamp?
Least hostile c++ developer
When I see a C++ developer with a Hawaiian shirt, I know I'm in for a good talk.
A person from the audience claims that on windows there's currently no way to reserve the address range for a mapped file without actually allocating RAM corresponding to that address range. If I understand that comment correctly, it means that if you want to mmap a 32GB file on windows, you'd have to allocate 32GB of RAM just to reserve its address space. The way it's supposed to work is that mmap should only reserve the address space corresponding to the file size and then virtual memory is supposed to redirect all memory reads from the mapped memory range to the hard drive, which has to be implemented on the OS level (but the commenter claims on windows it's not).
Can anybody confirm this or correct me?
I think the way it's supposed to work is what they were trying to describe, but using windows terms. Like if you check task manager in the memory section, you should see "In use" vs "Committed" where in this case, you would add 32GB to the committed section but not the In use section.
What they were probably trying to say was that there isn't away to map less of the file and that it *must* map the whole file to the address space.
The commenter is wrong. When file mapping view is created then only the address space of view-size is allocated without any committed physical storage. When any portion of the view is accessed by the CPU then the regular 'page fault' mechanism is triggered and the kernel maps physical memory to the accessed address space pages with the content of the appropriate file section. The only difference to the regular virtual memory is that the address space of the view is not backed by the common pagefile but the file being mapped, so when the physical memory is paged out then it can simply be discarded and writing-back-to-the-hdd is only needed if it is modified.
The whole mechanism is not much different to allocating address space by VirtualAlloc (flag MEM_RESERVE) and then allocating physical storage behind the individual pages by VirtualAlloc (MEM_COMMIT). File mapping and regular memory allocation work the same way but it's completely automatical.
What the commenter probably thinks of, is that the address space is always allocated which can be very large for certain file mappings. If you map a large file portion but only use a small portion inside it then address space is needlessly wasted which can lead to out-of-memory conditions when allocating plain memory later in the application. This was indeed a problem with 32 bit applications in the 2GB address space. But it shouldn't be with 64 bit applications.
Oh, what’s file; this my first question in 1990 when start first time learning computer programming in club, Dos and GWbasic.
I do understand that doing file I/O in a way that made sense on 1970's hardware does not necessarily make sense on present-day hardware. But I wonder if this can't be solved under the hood. I mean I don't expect that the number 4k for the buffer size is in the standard. That could be increased. And does the standard say anything that prevents an implementation of querying the file size and than set the buffer to that number (with some cap for really huge files)? So that in 90% of all cases there is only 1 chunk and the buffering is purely virtual? You can be a bit creative can't you? And I expect that the iostreams library is specified in quite abstract terms. I don't expect that it prescribes to be implemented in terms of fopen and friends. It could also be implemented in terms of that fancy OS API he was talking about. If you really want to be nice to people running ancient legacy software on ancient legacy hardware, or arduino and other embedded hardware, there could be an opt-out through a compiler option that selects the ancient implementation.
So I wonder, isn't he barking up the wrong tree? Is it really necessary to push the committee for new features and could he not just as well lobby with standard library implementers to reimplement the existing interface but optimised for modern hardware?
The nice thing about C++ is that it is so generic that it knows it needs to allow you to adjust the buffer size of your I/O stream. The annoying thing about C++ is that at its defaults fopen, fstream etc is slower than Python's open because they do different things.
By modifying the buffer size of my fstream to 8MB I was able to read through the file ~20% faster. But it took a week of research to figure out that this is a valid method used by several other people. And that it's still not the fastest solution. The fastest would be to mmap the file and pass those locations to my threads instead of reading through the whole thing in order to do essentially the same segmentation.
Filesystem is required to be able to get the size of the file unless you rely on the OS. If you rely on the OS you are making a separate file management program for each system.
There's other tricks like directly moving the buffer data into the destination char[ ] instead of copying from buffer to the char[ ]. Or creating a new string each time you create a buffer so that you can just switch the pointer over to another pointer.
All of that to get close to Python's default behaviour but never actually attain it... Because Python is actually mapping the file ahead of time.
But it also means that if your data size is consistent you can specify a chunk to the be expected size and not need to parse backwards. I guess?
So why not mmap the files? (unbuffered read I think is what this corresponds to) This allows your program to know where to place the pointer without 'reading' the data into a buffer. Allowing you to get to any point of the file extremely quickly. But as he says at the end of the talk there is no standard pre-made way to do so. I've found a library call Boost which does this, or you will need to use the mapping method provided by the OS you're working with. Except, hahah, Windows doesn't have one.
This also doesn't solve for the situations where you KNOW you want some specific data when you start some program. Which is where the embed comes into play. Essentially allowing you to automatically send the data to the hardware that will use it without needing to run through the binary via an I/O stream.
FAT32 has a strong use case: SD-Card connected to a microcontroller. If you have a 3D printer, or a data logger, or one of many kinds of gadgets of this type, they're going to pretty much need to use FAT32. You MIGHT implement exFAT these days but just a couple years ago it wasn't at all an option for intellectual property reasons, so it's still a little rare.
Nice talk. Standardized mmap would be rad!
I work for DMG Mori, so I got a chuckle at 49:18 😅
i couldnt sleep again, here we go!
Attacking C stdio with C++ stream is not wise.
please give me my hour back
RAM is not an another kind of cache. All kinds of cache made in purpose to have access to the RAM. You can't address cache, you only can address the RAM. Only RAM maters.
Not all addressable memory is random access, and ram is often used as cache for yet deeper layers of data. Nor are instructions required to be addressable in RAM or cache.
On the other side there was a time when main memory was called "storage" and the only persistent records were made on paper (Printing, automaticly punched cards and punched tape.)
I don't think he meant it literally is a cache, like L1 cache, L2, etc but rather the abstract meaning where its just a cache of data from, say, the hard disk. Which makes more sense in the context of virtual memory where unused memory can be paged out to the disk which is much slower than RAM.
Your memory hierarchy, not from the machine architecture perspective but from the data perspective, could be registers - L1 L2 L3 - DRAM - disk - network. At this very moment, you're interacting with files on disk which cache Internet requests. And those in turn run through a CDN which can be seen as another set of layers in the cache hierarchy, since you're getting local copies from a server within 2000km of you of distributed worldwide data.