Google Patches Linux kernel with 40% TCP performance

Hussein Nasser

มุมมอง 126 928

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 8 ก.ค. 2024
Google submitted a patch to Linux Kernel 6.8 to improve TCP performance by 40%, this is done via rearranging the tcp structures for better cpu cache lines, I explore this here.
0:00 Intro
0:30 Google improves Linux Kernel TCP by 40%
1:40 How CPU Cache Line Works
6:45 Reviewing the Google Patch
www.phoronix.com/news/Linux-6...
lore.kernel.org/netdev/202311...
Discovering Backend Bottlenecks: Unlocking Peak Performance
performance.husseinnasser.com
Fundamentals of Backend Engineering Design patterns udemy course (link redirects to udemy with coupon)
backend.husseinnasser.com
Fundamentals of Networking for Effective Backends udemy course (link redirects to udemy with coupon)
network.husseinnasser.com
Fundamentals of Database Engineering udemy course (link redirects to udemy with coupon)
database.husseinnasser.com
Follow me on Medium
/ membership
Introduction to NGINX (link redirects to udemy with coupon)
nginx.husseinnasser.com
Python on the Backend (link redirects to udemy with coupon)
python.husseinnasser.com
Become a Member on TH-cam
/ @hnasr
Buy me a coffee if you liked this
www.buymeacoffee.com/hnasr
Arabic Software Engineering Channel
/ @husseinnasser
🔥 Members Only Content
• Members-only videos
🏭 Backend Engineering Videos in Order
backend.husseinnasser.com
💾 Database Engineering Videos
• Database Engineering
🎙️Listen to the Backend Engineering Podcast
husseinnasser.com/podcast
Gears and tools used on the Channel (affiliates)
🖼️ Slides and Thumbnail Design
Canva
partner.canva.com/c/2766475/6...
Stay Awesome,
Hussein
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 434

@hnasr 4 หลายเดือนก่อน ⁺²⁶
Get my udemy fundamentals of backend engineering backend.win
@Rovsau 4 หลายเดือนก่อน
Kudos for a good explanation. Today, I learned!
@santapreese 4 หลายเดือนก่อน
hello! thanks for the insightful video:) is it possible for you to publish your content on linkedin learning? some organizations have a subscription for that instead of udemy.
@TheSulross 4 หลายเดือนก่อน
Another tactic that could potentially be used by the Linux kernel to speed up the network stack would be to pin one or more CPU cores to exclusive use for packet pipeline processing. Pinning would mean taking said core away from the kernel scheduler as a resource to be managed during context switching of thread execution.
This is a very well established tactic used by the Intel Data Plane Dev Kit (DPDK) and is referred to as Polling Mode Driver (PMD) approach. Such pinned cores execute code at 100% CPU cycle time, never block on anything - nor are they preempted by the kernel. In DPDK terminology, a pinned CPU core is referred to as an lcore and there is a DPDK API to execute C functions on said lcores. (Modern CPUs have so many cores that it warrants pinning some cores for specialized processing.)
So I have a DPDK program where I can configure how many lcores to dedicate to a lcore worker pool. Am using C++17 for this program. Various pipeline operations get established as constructed callable objects, which means that if I have a callable object stack variable called action, I can just invoke it as action(), or just treat it as a piece of aggregate data.
Said callable object supports C++ move semantics. If am configured to be executing the entire application on just a single lcore, then it just immediately invokes the callable object on that lcore context. If there are multiple lcores, then the action is C++ moved into a lock-free work queue. All of the lcores in the worker pool are in an endless loop of retrieving a callable object from the work queue and then just invoking it. Each lcore is assigned a unique token. The callable objects are published using this set of tokens - the publisher just ring cycles through the set of tokens as it publishes work items. The consuming lcores retrieve work items as identified by their respective token. This averts contention between lcores rendezvousing on the work queue and provides a fair, round-robin, load-balanced processing.
This kind of network programming greatly exceeds what can be accomplished by writing conventional networking applications that rely only on the kernel network stack and conventional multi-threaded concurrent programming. And such applications run entirely in user space - never touching the Linux network stack. So the high performance data plane pipelines never block, never transition to kernel mode, and never get preempted by the kernel scheduler.
And naturally data structures are tuned around the cache line size.
@treyquattro 3 หลายเดือนก่อน
ah, clickbait
@solarsunny 4 หลายเดือนก่อน ⁺⁴⁴⁹
Corollary: some wrong variable placements will give you 30% performance degradation.
@kartik4792 4 หลายเดือนก่อน ⁺³⁶
It's not about variable placement but organising data in memory.
@earthling_parth 4 หลายเดือนก่อน ⁺²¹
@@kartik4792 wasn't the variable placement causing how the data was organized in memory?
@bladman9700 4 หลายเดือนก่อน ⁺³⁹
@@kartik4792yes. organizing variable is how you organizing the memory. Genius.
@absalomdraconis 4 หลายเดือนก่อน ⁺¹⁰
@@kartik4792 : Trying to distinguish variable placement and memory placement as different things will sometimes cause massive misunderstandings. Your pedantic attempt at such a distinction is bad and you should stop.
@elpapichulo4046 4 หลายเดือนก่อน ⁺⁴
@@absalomdraconis i dont think its that pedantic, iirc the compiler can decide where on the stack a variable will be placed thus if two variables are declared after each other there might be no garuantee that they also are laid out next to each other in memory.
@ThePapanoob 4 หลายเดือนก่อน ⁺²⁸⁸
theres multiple things that coco did with "just moving around variables" that helped with the performance that you didnt even realize. :D
When the data is fetched and a cache miss happens the cpu fetches the whole memorypage putting it in the TLB cache. Reorganizing it in the way that neighbouring variables are contextually dependend to each other means that you have less roundtrips to the heap. But the other thing coco did was he organized the variables in a way that the compiler didnt pad them as much. Because compilers tend to 32bit align variables. so if you use a bunch of smaller variables (u8) between bigger ones (u32) you will get alot of padding resulting in a way bigger struct. That in return means that you will have more chances of the data you need not fitting in 1 memory page returing in more heap reads.
@lab-at-home 4 หลายเดือนก่อน ⁺¹⁷
Ram bus go vrum vrum?
@petter9078 4 หลายเดือนก่อน ⁺⁵
How do you even approach such a thing. How do you analyse it 🤔
@ThePapanoob 4 หลายเดือนก่อน
@@petter9078 trial & error… 😅 and loads of hours in ghidra & the godbolt compiler explorer
@davidmartensson273 4 หลายเดือนก่อน ⁺²⁰
@@petter9078 If you know about the problem, just looking at a struct will be enough to see possible "holes".
I never did anything on this level but back in the MSdos days you reorganized the order drivers was loaded because the OS started in the top and fetched the next free area the driver would fit into, but it never backtracked to check areas already skipped, so by rearranging the driver order you could get it to put a smaller driver in a space that otherwise got left empty.
That way you could minimize the amount of memory below the 640 kb limit that the actual programs had access to, and that could be the difference between being able to run some games or not.
Similar concept.
@mss664 4 หลายเดือนก่อน ⁺⁶
@@petter9078Processors expose performance counters. It ain't perfect, but you can see roughly where the cache misses are.
@tibbydudeza 4 หลายเดือนก่อน ⁺³⁵⁴
A standard C optimization technique back in the old days when you cared about performance - alignment your data structs via padding using bit fields to how the CPU fetches data from memory and fills it caches.
Programmers nowadays don't care about it.
@austinscott4695 4 หลายเดือนก่อน ⁺¹⁰
wow, where can I read more about this?
@ErazerPT 4 หลายเดือนก่อน ⁺⁹⁶
It's not that you don't care, it's that for the most part you don't have access to it. If you're not doing ASM, C, C++, similar, chances are you DON'T have low level access to ensure optimizations actually work.
@zoenagy9458 4 หลายเดือนก่อน ⁺²¹
Rule 1: never optimize
@tibbydudeza 4 หลายเดือนก่อน
It has been many decades since I did it.
This usually was when you dealt with CPU's that wanted word aligned data structs and could not deal with unaligned data fetches like the Motorola 68000 which would give you an interrupt.
You would do it like this by hand in your structs and use unions to overlay it to 32 bytes word 16 bytes.
Later your C compiler would automatically pad your C structs/unions to make sure they fell on word aligned boundaries.
The Intel memory controller was just smarter about it and did more than one memory fetch impacting performance
@@austinscott4695
@tibbydudeza 4 หลายเดือนก่อน
Not many of us write kernel drivers or OS code.
Game industry care about performance - you ever tried writing code for a PS3 with it's oddball architecture???.
Believe me you going to be staring at profilers and how fast the GPU is being fed with data and the CPU is being kept busy and how much memory is being consumed.
@@zoenagy9458
@youpapai 4 หลายเดือนก่อน ⁺¹³⁰
I miss system programming, this was standard when writing drivers/kernel modules etc.
High level programming, fast CPUs and large caches has spoiled everyone.
@ssl3546 4 หลายเดือนก่อน ⁺²⁷
More like the Linux kernel is so dang big now that optimization projects like this get overlooked. When Linux was running on 8MB systems (with X!) much more attention was paid to small details.
@avalagum7957 4 หลายเดือนก่อน ⁺¹
Can the compiler help anything here?
@jfbeam 4 หลายเดือนก่อน ⁺⁶
@@ssl3546 There shouldn't be any "optimization projects". If people paid attention to the efficiency of their code _when the first f'ing write it,_ no one would ever need to come back later to rewrite it. But that takes intelligence and time; things modern programmers don't have. (make things that can be sold, and get them to market as fast as possible. quality isn't necessary.)
@honkhonk8009 4 หลายเดือนก่อน
Its partially because everyone thinks its like javascript or python, and dont realize that C is quite simple and straightforward.
@cheebadigga4092 4 หลายเดือนก่อน ⁺¹
it's always features over optimizations, the latter can come later at any point in time, just like this patch
@johnswanson217 4 หลายเดือนก่อน ⁺⁹⁶
Chill guys... It's not about superiority, it's about productivity. Google do this because they can reduce their server cost significantly, not just because they're capable to do so. There are tons of micro optimizations left to be done out there but nobody cares because it doesn't make money. It's that simple. Follow money not your ego.
@kounelios13 4 หลายเดือนก่อน ⁺⁸
No one said otherwise . Of course at the end of the day this pr benefits google in more than one ways
@hanifa205 4 หลายเดือนก่อน ⁺⁴
I agree with you. People will give attention details when they can see $$ in their eyes.
@chetan_naik 4 หลายเดือนก่อน ⁺²
Is the improvement for both ends of the connection? Meaning on both client and server?
@TimeLemur6 4 หลายเดือนก่อน
Without looking into it beyond this video, I'd say it'll benefit both. That said, this type of optimization is dependent on scale, so the client will likely see negligible benefit unless they have a large number of connections open simultaneously.
@顔boom 4 หลายเดือนก่อน
@@chetan_naikThe improvement is measured in CPU seconds spent divided by the duration of a TCP round trip. So there's no real data in the patch notes in regards to whether this helped with the latency portion.
Basically, it depends on whether the bottle neck is the CPU side or the NIC side.
@shapelessed 4 หลายเดือนก่อน ⁺¹⁰⁴
So basically all that they did is moved things around to maximise the amount of data inside the structs that put sequentially would fit inside a 64-byte chunk of memory...
Let's say I need to handle a request, and I need the variables X and Y, but there are 80 bytes of other stuff between them. Now the CPU needs to request those two variables in two separate requests. If X and Y were next to each other, they would most likely fit inside that 64B block of memory, which would then get cached. When we're finished with X, we don't need to re-request Y separately, because it has already been cached from that 64B chunk we requested previously. We just cut our memory access time by 50%.
I didn't think about these sorts of things originally, but then I decided I'm a masochist and wrote a virtual filesystem with native AES encryption. I really learned a lot about memory when designing it, especially since AES works on 64-bit/8B chunks, so everything, every sector and block had to be structured perfectly...
@amalragc 4 หลายเดือนก่อน ⁺²
You explained it very clearly and in a very simple way. Thankyou.
@rupertsmixtapes812 4 หลายเดือนก่อน ⁺¹
I thought there was only AES-128 and AES-256, with 128 bits being a block size of 16 bytes, not 8 bytes..?
@StupidusMaximusTheFirst 4 หลายเดือนก่อน ⁺¹
Thanks for the explanation. By your "all they did" i assume this was simple to do or see? Then why hasn't this been done ages ago? 40% improvement is no joke.
@shapelessed 4 หลายเดือนก่อน ⁺²
@@rupertsmixtapes812 Yeah, sorry forgot whether it was 8 or 16 bytes because it was like 2 years ago and haven't really needed to dive down into the details of any encryption algorithm since then, thanks.
@shapelessed 4 หลายเดือนก่อน ⁺¹
@@StupidusMaximusTheFirst I clicked the video because I was interested, I heard about CPU caching, memory, then after the word "structs" I suddenly realised "Hell, they only just describe the type and order of data inside memory, I bet they just moved things around so that the most needed props are closer together and faster to read."
And as described in the video. Most people these days use, or start out their journey using high-level languages that don't give a damn about order of data, much like JavaScript and TypeScript's structs/interfaces which only just serve you at compile time. ThePrimeTime had recently posted a video about how junior devs are "noobs", where they discussed how people have stopped learning how to truly understand the code is actually doing on their machines, mostly because their performance is so large that people started focusing more about writing good looking code that is fast to understand and work with, rather than squeezing the performance they probably will never need.
The abundance of computing resources made these hyper-optimisations uneconomical, it's more profitable to roll out features than to refine existing ones.
When you go to a store, do you waste your precious time to find a plastic bag somewhere in your home and bother bringing it with you, or do you just buy a new one every time? They are cheap and abundant...
@sammyj29 4 หลายเดือนก่อน ⁺²⁰
It's just the way struct packs variables into memory. Going back to the basics!
@heyhoe168 4 หลายเดือนก่อน ⁺¹
Yeah, but one should always check with current architecture specification. What worked for i386 might be wrong for i686.
@kelvinleung3956 4 หลายเดือนก่อน ⁺⁸
Love that you share the latest changes or improvements in the industry with easy-to-understand explanation. Love the passion and thank you for the sharing! Could you please share more about OS/memory/cpu level stuff when codes are exeuted in the future videos?
@susmitvengurlekar 4 หลายเดือนก่อน ⁺⁵
Superb, its just like reading a page when it comes to database. Which is why clustered table has its benefits. Its all about optimising for the next action.
@Twangaming 4 หลายเดือนก่อน
Thanks for covering this!
@darlokt51 4 หลายเดือนก่อน ⁺⁴⁴
AFAIK This is as the data shows an AMD specific optimization. Intel uses an, compared to AMD, highly advanced prefetcher that unlike normal hardware prefetchers works not on memory adress acesses but also on the data values themselves. By this a physical reordering of the data in memory does not significantly impact the accuracy of Intel’s prefetcher.
@BattousaiHBr 4 หลายเดือนก่อน ⁺⁶
interesting. so that means there could be many other areas of the kernel that could be heavily improved on AMD whereas intel is mostly unaffected.
@darlokt51 4 หลายเดือนก่อน
@@BattousaiHBr Yes, AMDs Processors overall are a bit rough around the edges from a hardware perspective and brute force oriented. One can optimize programs to the quirks of singular architectures (as is done in supercomputing, as you know your hardware and can easily optimise for it), but this generally come with a regression for other architectures, what you should prevent. This change for example is simply better in general for the acess patterns in production and even Intel profits, if smaller, but AMD greater due to their problems in architecture. But because of this the better solution are smarter prefetchers and branch predictiors, to generally improve performance, because we all write shitty code and it’s better if you don’t lose 40% performance due to bad code / hardware architecture, not every program can be optimised to this level, hell, most programming languages don’t even offer this low level memory management. To make all code faster the Platform it runs on must be smarter.
@ismbks 4 หลายเดือนก่อน
where did you read about this
@darlokt51 4 หลายเดือนก่อน ⁺¹⁸
@@ismbks Its related to my work, you can find information about the different prefetchers in CPUs, in security guides by the manufacturer. Prefetchers and branch predictors can lead, if wrongly implemented, to security vulnerabilities, information about prefetchers or branch predictors is mostly only given if a security vulnerability is found, as they are responsible for a lot of current processors performance gains in architecture.
@tacokoneko 4 หลายเดือนก่อน ⁺³
@@darlokt51are you prohibited from revealing more information by your anonymous employer
@BR-lx7py 4 หลายเดือนก่อน
Love this! Best part is that it's trivially correct.
@ahmetyasarozer 4 หลายเดือนก่อน ⁺⁸
Thanks Hussein ❤
@michaelelkin9542 4 หลายเดือนก่อน
That is a very good descriptions. Surprising that something so simple makes such a bid difference. I would have expected this to be a compiler optimization, and I hope it will be in the future.
@mattmmilli8287 4 หลายเดือนก่อน
Very nice topic and presentation
@TheEndorfinized 4 หลายเดือนก่อน
dig your observations. Absolutely ..everyone should look out for things we overlook.
@_sudipidus_ 4 หลายเดือนก่อน ⁺³
this is fascinating.. recently I was exploring how golang does padding on struct variables to align those variables to fit the 32/64 bit word length so that memory reads align
never occurred to me it would bring this huge improvement.. I thought it was a microoptimization but I guess it builds up when it repeats (networking stack, TCP connections definitely happen in a huge volume and it's good to see the micro optimization getting multiplied to attain a significant improvement)
@honkhonk8009 4 หลายเดือนก่อน ⁺³
whats great about C is the fact that its so deterministic and simple.
Structs? Oh their just basically arrays lol.
Then it starts making sense how to optimize it for the cache
@meqdaddev4341 4 หลายเดือนก่อน
Impressive!
How the small and tiny things have this big impacts!
@sweetshit4931 4 หลายเดือนก่อน
Really insightful.
@Idlecodex 4 หลายเดือนก่อน ⁺⁶
Hey Hussein, this is the classic case of keeping memory compact according to it's temporal access, maybe intel's branching prediction was already doing the job of keeping the cache hot and that's why there was not much gain there. Another possibility is that the compiled code differs on the two architectures. Any how I always hear that optimization is architecture dependent, this seems to be the case.
@berndeckenfels 3 หลายเดือนก่อน ⁺¹
It’s not only about moving things next to each other but also move things which are updated by different CPUs on different cacheline so they don’t need to be invalidated on other sockets. This was more regularly done (besides having per-CPU struct)
@marcosfons 4 หลายเดือนก่อน ⁺¹⁰
Well some compilers organize variable order in structs, but is mostly based on alignment to get efficient memory access. But in this case, aot compilers would have trouble because this kind of optimization relies on the runtime aspects of the problem. Consequently, optimizing this at the compiler level becomes more difficult, given that those compilers are generic
@rogo7330 4 หลายเดือนก่อน ⁺²
It would break everything if your code would call with those sort of optimizations to other functions that take that struct. It should work with structs that only used by static functions in one file scope or with something like --fwhole-program (which basically makes everything static and exposes nothing outside the elf file), but I didn't ever checked that.
@phillipsusi1791 4 หลายเดือนก่อน
In C, the compiler is required to place the variables in memory in the same order they were declared in the struct. It is allowed to pad them for alignment, but not to change the order.
@tesla1772 4 หลายเดือนก่อน
Cant wait for the OS course
@myne00 4 หลายเดือนก่อน ⁺¹
It's kinda crazy that optimisations this big are still possible after 30 years.
@rayc1557 3 หลายเดือนก่อน ⁺¹
In the 1980's all thru the 90's, we did all our programming with this level of attention. Prior to then, I programmed mainly in assembly language -which required extreme attention to every instruction in the listing. BTW: The fact that IBM processors saw only a 1 to 5% increase, could be attributed to Intel possibly having a better caching structure and algorithm to start with.
@devenparab 2 หลายเดือนก่อน
Such a beautiful explanation Nasser, feels like listening to song
@BlurryBit 4 หลายเดือนก่อน ⁺²
ok wow 40%, you have my interest!!! 😮
@morthim 4 หลายเดือนก่อน ⁺³
'they just moved some variables around'
litterally the job of a programmer.
@rafagd 4 หลายเดือนก่อน ⁺¹¹
The compiler can't auto reorder variables because it breaks ABI compatibility with stuff compiled against your lib
@anandsharma7430 4 หลายเดือนก่อน ⁺¹
This aspect should be given more importance in the discussion. I assumed that compilers already did these kinds of optimizations. Also, the issue of whether the network hardware stack is so uniform (around the world) that these optimizations work on a majority of routers / switches / other network server hardware - enough to make it into the main kernel.
@rafagd 4 หลายเดือนก่อน ⁺³
@@anandsharma7430 There's a very interesting document in the KDE dev guidelines, and one of the biggest no-no's is that you can't change the order or even add/remove non-static data members to a class that is in use (regardless of visibility), because it will force a break on the binary compatibility.
@laden6675 2 หลายเดือนก่อน
C is so old. Insane to think the performance improvements Linux could get if they fully switched to Rust, since it does reordering and this stuff is internal.
@rafagd 2 หลายเดือนก่อน
@@laden6675 That is not how it works. While the C language is old, it's compilers are not. They are, in fact, built upon the same llvm backend that powers Rust. And the language is simple enough that loads of optimizations can be applied to it. C is still around for a reason.
Of course, Rust CAN produce faster programs than C, but the inverse is often the case.
Rust edge here is that it's harder to leave a security hole there than it is in C, and that in most cases, that comes at zero performance cost. Harder, but not impossible.
Finally, Rust's ABI is famously not stable so we'll still have to use C ABI to interop with dynamic libs for a while. While the Linux kernel is monolithic, it still needs to expose system calls, that will need to be called from a C program. So they would need to disallow the variable reorder anyway.
@laden6675 2 หลายเดือนก่อน
@@rafagd C compilers need to be standard compliant, and changing the order of fields from the one that's declared is non-standard.
Even if they could, because of how C programs are structured where a program is built from a collection of object files that need to have stable ABI which are independently compiled, there is much less opportunity for optimization than in Rust.
The vast majority of structs and functions are NOT exposed to the userspace, so this is a strawman.
> C is still around for a reason.
Inertia. It will take time, but over time there will be less and less C code in the software we use.
@matthewsheeran 4 หลายเดือนก่อน ⁺³
The cache line seems entirely synomous with pages in the kernel page cache. Huge performance increases are par for the course. For higher level programmers things like iterators and lazy loading make similarly huge performance improvements not to mention threading and multi-procesing. I am guessing that you may have hit the nail on the head at the end: as variable location may be compiler dependent but putting variables in a struct may then make the compiler then follow that exact order of definition.
@cheebadigga4092 4 หลายเดือนก่อน
Wow!! I have never thought once about how the alignment of C struct fields interacts directly with memory pages and L3 cache. Simple but massively effective. Props to Google for that one!!
@first-thoughtgiver-of-will2456 3 หลายเดือนก่อน ⁺¹
Rust automatically optimizes for argument position so your struct is still readable but doesnt require excessive padding.
@first-thoughtgiver-of-will2456 3 หลายเดือนก่อน
also, rfc #1358
@Sil3ntOne 4 หลายเดือนก่อน ⁺²
Subscribed 🎉
@thinkingcitizen 3 หลายเดือนก่อน
I've done your Udemy course, I've watching 40-50 of your videos, I just want you to know I frigging love you
@ac130kz 4 หลายเดือนก่อน
you can also lower connection close timeout for essentially free performance
@AmjadKhateeb 4 หลายเดือนก่อน ⁺³
Nice explanation. For your question at the end about auto optimize the order by the compiler. I don't think it will work. I think the struct is related to many things that are depends on the order, like serialization, and saving data to binary files. So, if you serialize or save to file in some order, then you have to deserialize in the same order.
@cameramaker 4 หลายเดือนก่อน ⁺¹
I think the best would be to have an allocation hint - to cacheline boundary (64B currently) for every malloc/kalloc call (or just align everything that way, lol)
@zoenagy9458 4 หลายเดือนก่อน
excelent content!
@user-oj9iz4vb4q 3 หลายเดือนก่อน
Organizing structs chronologically does have a technical reason. If you access a pointer to a new struct with code which expects the old struct definition, then it still works. It's binary backwards compatible. It doesn't work for arrays but it does work for individual structs. This behaviour can be very useful in systems that need to persist and go through many iterations like a kernel.
@NaokiWatanabe 4 หลายเดือนก่อน
L3 cache latency on an EPYC CPU is in the tens of nanoseconds range (19 to 31ns).
@BenUK1 3 หลายเดือนก่อน
Very interesting. It also made me think about the fact that there is a conflict between how us humans would logically group variables together for readability and how the data should be organised in memory for optimal cache utilisation, and the fact that there's no programming language that I'm aware of that explicitly deals with these 2 conflicting requirements.
Maybe we need a language where you can explicitly separate these 2 requirements, for example:
struct ColouredTriangle
{
unsigned int X1 : GroupA;
unsigned int Y1 : GroupA;
unsigned char Red1 : GroupB;
unsigned char Greed1 : GroupB;
unsigned char Blue1 : GroupB;
unsigned int X2 : GroupA;
unsigned int Y2 : GroupA;
unsigned char Red2 : GroupB;
unsigned char Greed2 : GroupB;
unsigned char Blue2 : GroupB;
unsigned int X3 : GroupA;
unsigned int Y3 : GroupA;
unsigned char Red3 : GroupB;
unsigned char Greed3 : GroupB;
unsigned char Blue3 : GroupB;
}
Here we have the source laid out in a logical order for humans to read, but the "Group" suffix tells the compiler to actually group them in memory differently to how they are laid out in the source.
That was just a noddy example of one way to do it, but I'm sure there would be better ways to separate these 2 concerns in an actual language.
@maxwellsmart3156 4 หลายเดือนก่อน
Benefits of code runtime analysis using a profiler. You can also benefit code performance by moving around code blocks. Most of the time it isn't about how fast the code is but how fast one can produce the code. There have been many hardware architectural and performance improvements since 1991 and most of the time code just lays dormant and gets reused.
@notgauravmehta 4 หลายเดือนก่อน ⁺⁴
cool video - I dont think it would make sense for the compiler to reorder the members of the struct. In some cases, I've casted the entire struct to a bytestream, and deserialized it somewhere else. In this case, if its reordering the members of the struct, my deserialization logic would fail on the other side, which would be unexpected behavior.
@PaulSpades 3 หลายเดือนก่อน
Completely true. A struct needs to be binary predictable, which is why this works.
@alessi4249 4 หลายเดือนก่อน
Also goes to show how optimised it was that this level of optimisation has that much of an effect
@lalitkale 4 หลายเดือนก่อน ⁺²
To be precise - its networking subsystem optimization.
@AlejandroRodriguez-wt2mk 4 หลายเดือนก่อน ⁺¹
The order of things actually matters!!
@jotec1290 3 หลายเดือนก่อน ⁺¹
The better question would be: How many Megawatts of Energy Google saved by this patch?
@Zelidar 4 หลายเดือนก่อน
As we all know, the Linux kernel was first released in 1991, by Linus Torvalds. Since that time, Linus and thousands of other sharp-minded software engineers have relentlessly continued to develop and refine it. It is pleasing to see that such significant improvements can still be made 32 years later; ... it is also a bit surprising, I must say.
@xcoder1122 4 หลายเดือนก่อน
C compilers could in theory optimize structs themselves by re-ordering their members, the problem with that is, that the struct layout is not stored in the compiled code, so if a binary and multiple libraries share the same struct type and are built with different compilers or different versions of the same compiler, the order may differ and then everything would break, that's why compilers won't change the order. Also compiled code relies on the order given at compile time, so every time the order changes, all code that reads/writes from/to the struct, must be re-compiled. Optimizing the order of structs is something programmers must do by hand in C and can only do if no external code relies on a given order (as is the case with the Linux kernel, where non-kernel code doesn't know anything about those in-kernel structures).
In languages other than C, compilers may re-order members of data structures themselves but that is only possible if the compiled code stores the order that was used by the compiler and if all compiled code is written in such a way, that it can adapt to any order chosen and any order change in the future without requiring re-compilation. The downside of such a feature is that all structure access will be slower and the code will be bigger and more complicated. Keep in mind that C is simple, low level and supposed to be fast.
@yashkhd1100 4 หลายเดือนก่อน ⁺²²
I can't beleive it...Kernal TCP/IP stack was without padding in mind. Years ago when I used to write Game Engines..the first thing you do with various data structure is to align it with padding. There are tons of pragmas exist on all majority C/C++ compilers to support this. This perf boost is good...but it's like someone forgot to change a Engine oil in a car and than when they change it after years they feel gud about it and Car is running smooth too.
@kanishkgoel2673 4 หลายเดือนก่อน ⁺⁹
This one didn't have a lot to do with padding, more with spacial locality, things that needed to be used should be kept close together.
@rogo7330 4 หลายเดือนก่อน ⁺²
Compiler already pads variables for you: all longs will be on addresses alligned to long's size, all ints alligned to size of int, all struct x alligned to size of struct x, etc. This have nothing to do with padding, it's about that your variable is much bigger than cpu can fetch at once and it goes to the memory more. Padding is about your long is missaligned with its size, and because of that CPU must fetch two ints and take only last half of first and first half of second, glue them together and save it somewhere else. That's not the problem here.
@AndersonPEM 3 หลายเดือนก่อน
Reorder a struct
GETS 40 PERCENT IMPROVEMENT
Now that's a level of optimization I can only dream of achieving :)
@straightup7up 4 หลายเดือนก่อน
Significant performance improvements with simple realignment.
I wonder if changing tab stops from 4 to 2 will have similar effects.
@FredPhillips32169 4 หลายเดือนก่อน
IW10 was a huge improvement and it took forever to catch on.
@NatesRandomVideo 4 หลายเดือนก่อน
Great discussion. Now they need to add the same optimizations for the other architectures! Grin…
@973sandman 4 หลายเดือนก่อน
If zero packets are received you have zero cache miss, the compiler can't know at compile time. Kudos to Google's people for identifying the variables to be grouped, wondering if there's some runtime profiler for Linux to help with that.
@shadyabhi 4 หลายเดือนก่อน
What resource are you currently using to learn more about cache lines and data driven design?
@jonathancrowder3424 4 หลายเดือนก่อน ⁺¹
Really got to hit that line if you want the speed
@tenthlegionstudios1343 4 หลายเดือนก่อน
Awesome video. For more on this topic of memory alignment and cache lines, I highly suggest Andrew Kelleys video from Handmade cities conference. This is also known as Data Oriented Design - loosely. The gains he got out of his compilation for zig with similar optimizations was insane. And he has real examples from the code. Thanks for the vid Hussein!
@astronemir 4 หลายเดือนก่อน ⁺¹
This is fucking amazing actually
@jozsab1 4 หลายเดือนก่อน ⁺¹
256 Mb cache ... back in the days had 56 MB RAM ( yes, my dream was to add another 8MB ) under DOS :D
@phillipsusi1791 4 หลายเดือนก่อน ⁺¹
That's a really weird configuration. 3 sticks of ram, 32 + 16 + 8?
@ehudgavron9086 4 หลายเดือนก่อน
Hussein, thank you for the in-depth analysis. Also h/t to Michael@Phoronix because he's on top of everything.
Is this good - yes in terms of results.
No in terms of layer violations.
So, in simple terms (for the readers) if we choose to improve the end result (a good thing) but to do so we must violate the layers of the protocol, the software, the system, and the CPU, our underlying assumptions that work well today become accepted and may not work further down the road.
Simply put: TCP only requires a usable IP underlayer. IP only requires a usable physical (PHY) layer. The physical layer only requires a usable OS and Drivers (Winderz, Linux, BSD, MacOS, etc.). That final part is the hardware, which in this case is unchanged but future versions may have future hardware that doesn't play as nicely with those same OSs and drivers.
When you violate the layering, you make assumptions about lower layers or upper layers that are likely true and useful TODAY for YOUR setup. They may not apply tomorrow or for everyone's setup. It's unreasonable to expect future users to benchmarks whether the changes are good or bad for their setups... and so we avoid layer violations.
As an experimenter -- good job Google.
As a consultant -- good job Hussein and good job Michael Larabel @ Phoronix.
As someone who likes to leave a job done for good and never be called back... not so much.
Ehud
Tucson
Arizona
US
@nghiaminh7704 4 หลายเดือนก่อน ⁺²
game dev calls this data-oriented design
@houdaifabouamine 4 หลายเดือนก่อน ⁺²
Thank you,
I hope that you share more content like that, a payed series is also good idea.
@Murph9000 3 หลายเดือนก่อน
Compilers can't reorder traditional C structs, as the struct is essentially a recipe for encoding and decoding a binary blob of data. This is relied on for things like accessing protocol headers, where the data order is externally defined. Additionally, the struct will be defined in one place (usually a header file), but may be used in many source files; the compiler doesn't know all of the places it is used when compiling individual source files.
@jeffcauhape6880 3 หลายเดือนก่อน
It's funny that no one thought of this before.
@ahmdelemam 4 หลายเดือนก่อน
جميل يا حسين كالعادة ❤
@konfushon 2 หลายเดือนก่อน ⁺¹
not to be a party pooper, but the order of variables in smart contracts in Ethereum matters a lot, like a lot! This is so that we save gas. So seeing this tweaking of variable order in Linux makes me happy honestly.
Hey, seems like engineering concepts are the same no matter where you work in
@shadyabhi 4 หลายเดือนก่อน
How do I identify these data reorganization possibilities? Any resources?
@embeddedbastler6406 4 หลายเดือนก่อน ⁺³
One has to wonder which performance improvements might be possible by implementing a data orientied design, i.e., the switch from the traditional Array-of-Structs (AoS) to the more cache-friendly Struct-of-Arrays (SoA) memory layout.
@PR-cj8pd 4 หลายเดือนก่อน
Is this theoretical or does anything like that already exist?
@embeddedbastler6406 4 หลายเดือนก่อน ⁺²
@@PR-cj8pd As an example, data oriented design is becoming more common in modern game engines in the form of Entity Component Systems (ECS). For compute intensive workloads such as physics simulations, this approach promises a considerable speedup.
@smlgd 4 หลายเดือนก่อน ⁺²
@@PR-cj8pd computer graphics stuff use this a lot, from renderers to game engines, often the whole software architecture is based off that concept
@olafbuitelaar 4 หลายเดือนก่อน
The intel bench seems to have a 200gbs nic vs the amd only 100?
@Devasia.Thomas 4 หลายเดือนก่อน
It'll be nice to know how this affects older or current cpus with less cache. Datacenters churn its infra every few years, so it definitely helps them for sure.
@edwardmacnab354 4 หลายเดือนก่อน
just use a different kernel
@Devasia.Thomas 4 หลายเดือนก่อน
@@edwardmacnab354 Thats always an option if you have a problem with it. I was just curious on how it affects older stuff :)
@oleg9ff 4 หลายเดือนก่อน ⁺¹
0.000629 on Intel vs 0.000173 on amd. In my opinion switch to amd is the best kernel patch🎉
@rollinkendal8130 4 หลายเดือนก่อน
awesome!
@Felipe-bi3mk 3 หลายเดือนก่อน
Hey Hussein, would you make a video about how does it work when we have an index in X column, and we make a query that has a "table.X IN (1, 2, 3, ..., 300) " but there is a big list of parameters inside the In clause? I know it uses the index but seems very slow
@anandnaik9802 4 หลายเดือนก่อน
Can you please tell how it was discovered that such an optimization can be made? Do there exist any performance monitoring libraries for this specific purpose? Also, since this is an optimization problem, what is the best approach to find the optimum i.e. was this solution arrived at by human knowledge along with some data analysis or was this problem solved by some Reinforcement Learning approach?
@Skungalunga 4 หลายเดือนก่อน ⁺²
I'd like to know what the relative performance of TCP/IP with respect to Intel and AMD before the optimization was implemented. Is it that AMD was faster and the gap grew, the performance was the same and AMD gained an advantage or Intel was faster in the first place. Long ago Intel used a cache inclusive approach and AMD caches levels were exclusive and so had a higher latency. I'm not sure if that is still the case but it might have some bearing on how to optimize the code.
@lhxperimental 4 หลายเดือนก่อน
Compiler cant own it's own decide to change the order of struct members. Structs are often serialized as is. So the order of bytes on wire would change if the members are recorded.
@saidbakr หลายเดือนก่อน
So, if my Linux kernel is 6.8, would the ping response time be shorter by 40% to my router? Or does it need my router use 6.8 too?
@hackaholic01 4 หลายเดือนก่อน
I know something about spatial locality(when an address is a accessed then in future near address can be accessed ). This was not obvious that struct should hold variables together adjacent together. How they can miss this at the first place?
@manw3bttcks 25 วันที่ผ่านมา
It's a up to 40% reduction of cpu overhead not necessarily "my PC now sends 40% more bytes per second"
@netman87 4 หลายเดือนก่อน
Does this matter with virbr between virtual servers?
@hadekhae.f.5847 4 หลายเดือนก่อน
I live at the mercy of these engineers, hope they stay fit and sharp 🥹
@kellymoses8566 4 หลายเดือนก่อน ⁺¹
performance regression is so easy to cause.
@cyberhard 4 หลายเดือนก่อน ⁺¹
When the source is open, the possibilities are endless.
@RustedCroaker 3 หลายเดือนก่อน
The better question is:
How relocation of variables in a structure to please a cache made Intel 3.31% slower?
@Ray14324 4 หลายเดือนก่อน
When you add something to a struct its usage is in general more limited as its usually part of a new feature or mechanism. At that point you don't know how often that part will be used. Over the course of years its use may change. So the one who added the line can't possibly take into account cache misses on each structure change, especially in big code bases. Your expectation for a developer to take that into account is unjustified. Every couple of years its justified to perform a measurement on cache hits, as a separate task for optimization purposes. Those things usually come later into the game
@korgmangeek 4 หลายเดือนก่อน
Should we say de fragmentation of memory? Or maybe de fragmentation of cache memory?
@CallousCoder 4 หลายเดือนก่อน ⁺⁷
My question always is in these complex low level subsystems that are so pervasive: "how do you test them properly?" And how do they affect (in this case) other CPUs. But it's a very impressive gain and I know the Linux foundation won't merge something this crucial without proper testing and analyses.
@edwardmacnab354 4 หลายเดือนก่อน
Changing around variable location in a structure would affect a helluva lot of other code requiring a lot of analysis and testing . You never know , it might not even be a good idea in the end
@astronemir 4 หลายเดือนก่อน ⁺¹
@@edwardmacnab354at the end of the day if it’s a big gain for some that contribute to the kernel and a wash for others, hope they merge it.
@Krmpfpks 4 หลายเดือนก่อน ⁺⁴
@@edwardmacnab354changing around variables in internal structs will not change anything, as all code using them is recompiled and use the new layout. It just changes how the cpu is caching the values. Of course they do not change structs that are part of the API which would break userspace.
@CallousCoder 4 หลายเดือนก่อน
@@Krmpfpks i agree when I watched the video I was like: “oh is this only what they’ve done” this pretty safe. In the worst case scenario on a CPU with smaller caches it could perform the same as an extra memory fetch is needed
@CallousCoder 4 หลายเดือนก่อน
@@edwardmacnab354 Naah when I actually watched the video I was like: “oh a variable position with a struct that’s pretty safe”. Now it’s not to say that it performs 30% faster on each CPU but in the worst case it just involves a memory fetch that may already do. So nothing gained and nothing lost.
@darkwinter7395 4 หลายเดือนก่อน
This is why profiling is SO important.
@alivepenmods 4 หลายเดือนก่อน ⁺⁶
Hey Hussein,
I noticed you called yourself "noob" "normal programmer" and various self-depreacating names.
Please don't. Low-level programmers aren't smarter than backend engineers. They just work on different systems, where the complexity is linked to legacy codebases, ISA and hardware, while our systems complexity lie in external dependencies and distributed systems.
Unions and knowledge of memory alignment are standard knowledge for C programmers, but it doesn't make then god and it's no more difficult than handling data consistency and availability over a network.
@srivarshan7925 4 หลายเดือนก่อน
I am thinking that cannot we implement this way of optimization within compiler itself.... This is algorithm problem to just order element in struct to reduce the memory alignment.
@offBeatRock777 4 หลายเดือนก่อน
Can this be back ported to 5.x please 😅
@NixUgriBugri 3 หลายเดือนก่อน
It may worth to mention that Google standardized using customized AMD CPU based servers for a while (I maybe mistaken
@darknetworld 3 หลายเดือนก่อน
I wonder if the A.I would try to reduce lag?
@za_ozero 4 หลายเดือนก่อน
My only question is can it invalidate networking drivers and such
@nexpro6985 4 หลายเดือนก่อน
The fact that this performance improvement was implemented is less remarkable than the fact that it has taken so long for it to be discovered.
@mementomori8856 4 หลายเดือนก่อน ⁺¹
Turned my nonfictions on
@pablocoyote9647 4 หลายเดือนก่อน
No intel ipv4 or did I miss it ?
@nicholasfigueiredo3171 4 หลายเดือนก่อน ⁺¹
The compile can't organize your struct to make more efficiency since you will only know at runtime. You need to make it manually after running it at least once. It is also not a good idea for the compile to re-organize your struct since some optimization rely on it being that way you need a specific reason to change so you do manually afterward in cases like this
@davidmartensson273 4 หลายเดือนก่อน ⁺²
Its also about memory layout, in C a struct is continuous memory and you can for example cast one struct into another and read the same data in another way, the compiler will not be able to know if any user of the struct does anything like that and therefore cannot change the order without risking breaking changes.

ต่อไป

เล่นอัตโนมัติ