Implementing Understandable World Class Hash Tables in C++ - Eduardo Madrid, Scott Bruce CppCon 2022

CppCon

มุมมอง 13 465

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 25 ส.ค. 2024
cppcon.org/
---
A Success Story in Implementing Understandable World Class Hash Tables in Cpp - Eduardo Madrid and Scott Bruce - CppCon 2022
github.com/Cpp...
We present a success story about implementing one of the most important data structures, hash tables, with world class performance.
C++ allowed maintaining the Robin Hood hash table invariant (abbreviated RH) with the performance of parallel computation, even using only normal integer operations:
1. The execution costs of maintaining this invariant were minimized
2. The benefits are realized beyond other implementations
3. The actual code is easier to understand than alternatives in the performance S-tier
RH is a powerful invariant that allows implementations to unlock non-obvious and substantial benefits including cache locality.
Our solution for this challenge is to use parallel computation techniques that require only normal integer processor operations, and can be scaled up to "SIMD" operations (SIMD means "Single Instruction Multiple Data").
The parallel programming abstraction layer is composable, programmer understandable, and friendly to the optimizing compiler. We will prove these assertions with real source code, benchmarks and Compiler Explorer disassembly.
This will illustrate how specifics of RH and the scalable SIMD-based implementation gives not-seen-before improvements: using the bits devoted to "Probe Sequence Length" not just for cache locality but also reducing false matches exponentially.
These examples gives us confidence that fully realizing the powers of this invariant will outperform alternative schemes.
---
Eduardo Madrid
Tech Lead at Snap, Inc., helping performance on augmented reality of the Snapchat app. Author of open source libraries and components such as the Zoo type-erasure framework. Prior experience at Fintech, including automated trading. Several times presenter at CPPCon, C++ Now, C++ London
Scott Bruce
Scott has been writing software professionally for 20+ years.
He's been entranced by performance and efficiency at least 15 of those 20+ years.
He worked in distributed systems, Search, Adwords, Adsense and ML at Google for 13 years. He worked 4.5 years at Snapcht, on monetization systems, performance, and advanced mobile telemetry systems. He is currently an engineer at Rockset, working on distributed system performance and real time analytics. Presenter of a production software talk at Google, Snap, UCLA, USC, Caltech and others over the last ten plus years.
---
Videos Filmed & Edited by Bash Films: www.BashFilms.com
TH-cam Channel Managed by Digital Medium Ltd events.digital...
#cppcon #programming #hashtable

ความคิดเห็น • 21

@thomasokeeffe5050 ปีที่แล้ว
This is a great talk to watch right after the Swiss table talks
@JohnDlugosz ปีที่แล้ว ⁺⁴
You don't have to ban _all_ division.
Obviously you can use it in a constexpr.
You can also do division by a value known at compile time: this will be optimized into a multiplication and a couple of shifts.
@GreenJalapenjo ปีที่แล้ว ⁺⁵
I mean, it seems pretty clear from context that they ban division *instructions*, not division operators in the code.
@surters 3 หลายเดือนก่อน
@@GreenJalapenjo How can you ban an division instruction, code inspection? having looked a lot on the GCC flags I don't remember any that directly forbid any specific instruction.
@joachimjoyaux179 ปีที่แล้ว
Great talk. Many thanks !
@MisterDan ปีที่แล้ว
Very nicely done.
@stephenl7797 ปีที่แล้ว ⁺³
Is there an open-source repo by these authors? I cannot find it.
@yangwei6761 ปีที่แล้ว ⁺³
undered_map 58ms 57ms 59ms
robin_hood 7ms 7ms 7ms
wtf? phenomenal if true.
@JohnDlugosz ปีที่แล้ว ⁺⁷
We already know that the regular unordered_map is very slow. Even something like flat_map is faster, because of memory locality. You need quite a few entries before hashing would be faster, but then you find the "bucket" implementation is slower at that size!
@noahgoldstein3951 ปีที่แล้ว ⁺⁷
Was there a GitHub link?
@zangetsu1310 ปีที่แล้ว ⁺¹
It would be in Eduardo's "zoo" repo, and may be there soon, but the repo home page says to contact them for access, as they are working on it privately.
@zangetsu1310 ปีที่แล้ว
Now it says they're shooting for an initial implementation this week!
@eduardomadrid2380 ปีที่แล้ว ⁺²
This was an oversight that we did not include the link to the repository.
See my other comment.
@eduardomadrid2380 ปีที่แล้ว ⁺²
@@zangetsu1310 Not an initial implementation, but a an initial public release, meaning something for people to use as opposed to the code we used to research what we presented.
@jamesjaya6744 10 หลายเดือนก่อน
@@eduardomadrid2380 I am sorry if this is a dumb question, but why SWAR over SIMD? why not just implemented the technique presented using SIMD? Seems like we can get even more parrarelization using SIMD. Is it just because of portability, or SWAR is faster because SIMD might have more overhead for bit parrarelization?
@JohnDlugosz ปีที่แล้ว
42:39 if this is going where I expect -- checking 8 bytes starting at the position of the home slot -- then you're relying on the permissive nature of x86 to read mis-alinged words from memory and do so efficiently.
@NXTangl ปีที่แล้ว
My understanding is that all loads are aligned, but bit shifts are used to do the comparisons unaligned. So just under 2x the number of aligned loads on average, plus some bit fiddling that's basically free, and you also potentially eliminate metadata mirroring.
@nirajandata ปีที่แล้ว
Yoo

ต่อไป

เล่นอัตโนมัติ

C++ Algorithmic Complexity, Data Locality, Parallelism, Compiler Optimizations, & Some Concurrency