How Fast can Python Parse 1 Billion Rows of Data?

Doug Mercer

มุมมอง 134 750

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 15 พ.ค. 2024
To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/DougMercer .
You’ll also get 20% off an annual premium subscription.
-------------------------------
Sign up for 1-on-1 coaching at dougmercer.dev
-------------------------------
The 1 billion row challenge is a fun challenge exploring how quickly we can parse a large text file and compute some summary statistics. The coding community created some amazingly clever solutions.
In this video, I walk through some of the top strategies for writing highly performant code in Python. I start with the simplest possible approach, and work my way through JIT compilation, multiprocessing, and memory mapping. By the end, I have a pure Python implementation that is only one order of magnitude slower than the highly optimized Java challenge winner.
On top of that, I show two much simpler, but just as performant solutions that use the polars dataframe library and duckdb (in memory SQL database). In practice, you should use these, cause they are incredibly fast and easy to use.
If you want to take a stab at speeding things up further, you can find the code here github.com/dougmercer-yt/1brc.
References
------------------
Main challenge - github.com/gunnarmorling/1brc
Ifnesi - github.com/ifnesi/1brc/tree/main
Booty - github.com/booty/ruby-1-billion/
Danny van Kooten C solution blog post - www.dannyvankooten.com/blog/2...
Awesome duckdb blog post - rmoff.net/2024/01/03/1%EF%B8%...
pypy vs Cpython duel blog post - jszafran.dev/posts/how-pypy-i...
Chapters
----------------
0:00 Intro
1:09 Let's start simple
2:55 Let's make it fast
10:48 Third party libraries
13:17 But what about Java or C?
14:17 Sponsor
16:04 Outro
Music
----------
"4" by HOME, released under CC BY 3.0 DEED, home96.bandcamp.com/album/res...
Go buy their music!
Disclosure
-----------------
This video was sponsored by Brilliant.
#python #datascience #pypy #polars #duckdb #1brc

ความคิดเห็น • 356

@dougmercer หลายเดือนก่อน ⁺⁹
To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/DougMercer .
You’ll also get 20% off an annual premium subscription.
@eddie_dane หลายเดือนก่อน ⁺²⁹⁶
Are mustaches the new hoodies for programmers now?
@dougmercer หลายเดือนก่อน ⁺²⁷
I grew mine at start of COVID ironically and never got rid of it ¯\_(ツ)_/¯
@raniwishahy1904 หลายเดือนก่อน ⁺²⁸
Prime mentioned
@dougmercer หลายเดือนก่อน ⁺¹⁴
@@raniwishahy1904 blazingly fast!
@ApexFunplayer 26 วันที่ผ่านมา ⁺¹
Thighhighs bruh.
@knut-olaihelgesen3608 20 วันที่ผ่านมา
Why not annotate the return type of your functions? Those are the most important once. Because it return a sequence, annotating the elements of that sequence in the return type will be golden for the rest of the program
@danieljakob1307 หลายเดือนก่อน ⁺¹⁹⁷
The Summoning Salt homage at 8:26 is brilliant. Fantastic video!
@dougmercer หลายเดือนก่อน ⁺¹³
Thanks =] I had way too much fun with that, haha
@ric8248 หลายเดือนก่อน ⁺⁷
It would have been great to play the musical theme right there!
@dougmercer หลายเดือนก่อน ⁺¹⁶
Summoning Salt does use the track I played there sometimes ("4" by HOME).
Love his music choices =]
@joker345172 28 วันที่ผ่านมา ⁺⁵⁴
8:24 Amazing trick! It reminds me of computer graphics class where we had to find a way to improve the DDA Line algorithm... No one could do it. Then, the professor showed us the Bresenham algorithm. It's such a simple concept - instead of working with floats, work with integers! - but it saves soooo much time. It goes to show that sometimes the data type you're working with can have a huge effect on how fast your code is.
Drawing a parallel to Machine Learning, this is also why new GPUs have FP8 and FP16 as big selling points. Training with FP32, which is still the standard for a lot of applications, is just dog slow compared to using FP16 or even FP8.
@dougmercer 28 วันที่ผ่านมา ⁺³
Very true! (Also, super cool algorithm -- I never worked with computer graphics so I just read up on bresenham's algorithm)
@Deltax64 23 วันที่ผ่านมา ⁺¹
Half true - the main benefit of FP8/FP16 is reduced memory footprint, not so much the fact that individual operations are faster.
@mnxs 17 วันที่ผ่านมา ⁺¹
@@Deltax64should individual instructions not be slightly faster on smaller data? I don't actually know how floating point ops are implemented in hardware, I've only learnt a bit about integer arithmetic hardware, and in those cases I'd think bigger data sizes would mean slightly slower performance since certain ops needs to have partial results cascade or certain ops indeed needing multiple micro-ops. However, mostly it depends on the complexity of the arithmetic circuitry.
But yes, the smaller data size is likely the biggest winner. So much of GPU processing is memory bound, plus you can fit larger data sets into memory with better cache performance when you have smaller data units. (I'm wondering if the practical implementation of ops with these smaller data types really just do a conversion to f32, compute, then covert back down. Would simplify things, if nothing else.)
@Deltax64 17 วันที่ผ่านมา ⁺²
@@mnxs
> @Deltax64 should individual instructions not be slightly faster on smaller data? I don't actually know how floating point ops are implemented in hardware,
Yes, they *can* be faster, just depends on the chip. But like you say, sometimes they're implemented internally by widening and converting back -- so may be the exact same!
@guinea_horn หลายเดือนก่อน ⁺⁴⁵¹
C *can't* be slower than Java, can it? The slowest C implementation would be to implement the entire JVM and then write bad Java code
@dougmercer หลายเดือนก่อน ⁺¹¹³
From some comments on Reddit, they speculate the Java implementation performed better C because Java has a JIT. www.reddit.com/r/Python/comments/1c4ln3x/comment/kzshq27/
Alternatively, since the challenge started in Java community, more people worked together to find more optimizations.
@alfiegordon9013 29 วันที่ผ่านมา ⁺⁴⁸
You would be SHOCKED how much slow linked libraries make a lot of code
It's why LuaJIT FFI C is as much as 25% faster than native C, because it doesn't have to do linking
@stefanalecu9532 29 วันที่ผ่านมา ⁺⁵⁹
@@dougmercerto my knowledge, Java hasn't broken the 1s barrier, while the fastest C solution is 0.5s, so C isn't losing its job any time soon
@dougmercer 29 วันที่ผ่านมา ⁺²⁷
Who got down to 0.5 seconds in C?
@mayur9876 29 วันที่ผ่านมา ⁺⁷⁷
Jit has a runtime cost. No way java beats C in terms of code execution. To me this sounds a C skill issue😅
@BosonCollider หลายเดือนก่อน ⁺⁸⁹
The actual lessons from this is:
1: use duckdb
2: otherwise, use polars
3: use pypy more, and push back against libraries that are incompatible with it
@dougmercer หลายเดือนก่อน ⁺¹³
Yup, absolutely
@seantparsons 29 วันที่ผ่านมา ⁺⁶
The lesson I took from this is that you should probably just write it in Java in the first place.
@squishy-tomato 29 วันที่ผ่านมา ⁺⁹
introducing sql makes it not worth using duckdb over polars imo, unless you absolutely need those 2s
@mathmaniac43 24 วันที่ผ่านมา ⁺¹¹
What did you not like about the index variables in booty's orginal code? I find named variable indexes more readable than "magic numbers". I would have probably used an enum with incrementing values instead.
@dougmercer 24 วันที่ผ่านมา ⁺⁷
You're right. I've since changed my mind.
When refactoring, I got a bit fast and loose with timing and making multiple changes at once. I thought that removing them helped performance, but I was mistaken. They definitely help maintainability and should have been kept
@skanderghamgui5039 29 วันที่ผ่านมา ⁺⁷⁰
I had a project last year where I had to automate a manual process using Python to extract data from an Excel file and auto-fill an XML file. After I finished the project, I reduced the process from 3 months of human work to a 20-minute code run, which made me and my boss very happy. I wish I had seen this video last year; we could have been even happier. Nevertheless, it's great to know that I can achieve such high levels of Python performance. I will ensure better time management for my future projects.
Thanks.
@dougmercer 29 วันที่ผ่านมา ⁺¹¹
3 months to 20 minutes is a great speed up!
How frequently do you need to extract the data? One of my favorite XKCD comics is "Is it worth the time?" xkcd.com/1205/
Odds are, 20 minutes is good enough =]
@skanderghamgui5039 29 วันที่ผ่านมา ⁺¹¹
@@dougmercer the company I worked for needed that data often almost on every project they accept so yeah I saved them a ton of time! That was my end of study project during a 6 month internship which I used to succeed with high honors from the university.
@smol.bird42 หลายเดือนก่อน ⁺³⁰
your editing has so much taste, great video bro
@dougmercer หลายเดือนก่อน ⁺¹
That is such a nice compliment! Thanks =]
@FirroLP 27 วันที่ผ่านมา ⁺¹⁵
Dude, your production quality is so good it's criminal. Had to tell you
@dougmercer 27 วันที่ผ่านมา
Thanks man, that's such a nice compliment. I really appreciate it =]
@shadamethyst1258 29 วันที่ผ่านมา ⁺⁷⁰
I'm impressed you did not do any profiling, nor any statistical test to rule out measurement fluctuations
@dougmercer 29 วันที่ผ่านมา ⁺²⁸
There definitely are a good deal of fluctuations --- it's largely why I used language like "10ish seconds" and waited to see reasonably large deltas in performance before declaring an improvement. Things definitely get tricky to measure at this speed!
@f1shyv1shy35 27 วันที่ผ่านมา
Why not use something like hyperfine?
@micmaxian 24 วันที่ผ่านมา ⁺³
I think this would have blown up the scope of the video and also made it harder for non stats people to understand. I liked the fun ish measurements! Really good video, definitely subscribing and looking forward to more fun and informative content in the future Doug!
@otty4000 หลายเดือนก่อน ⁺¹⁵
wow this was a really great video. Its impressive to explain code/libraries differences that quickly and clearly.
@dougmercer หลายเดือนก่อน
Thanks =]
@fatcats7727 23 วันที่ผ่านมา ⁺³
Just wanted to say, all of your videos are incredibly clean and well edited, and althought the algorithm isn’t picking it up rn, your efforts will not go unnoticed!
@dougmercer 23 วันที่ผ่านมา
Thanks so much =]
@artlenski8115 29 วันที่ผ่านมา ⁺⁷
Highly optimised C with proper compiler specifiers taking almost double the time of Java implementation, even if GC is turned off.. hard to believe.
@garette8672 28 วันที่ผ่านมา ⁺¹
how could it possibly be hard to believe that more people happened to try to optimize the java implementation.. not a crazy concept and surely plausible.
@50shmekles 19 วันที่ผ่านมา ⁺²
This is one of the most well-done, detailed and thorough yet clear, concise and to the point videos ever. Thank you for introducing me to new concepts and libraries!
@dougmercer 19 วันที่ผ่านมา
Thanks! Glad it was helpful!
@MakeDataUseful หลายเดือนก่อน ⁺⁵
Great video, thanks for taking the time to create 🤙
@dougmercer หลายเดือนก่อน ⁺¹
Thanks! =]
@gharren 29 วันที่ผ่านมา ⁺⁸⁹
As we all know, Python is the fastest programming language there is. By the time your program has done it's job, the C++ developer is still busy fixing segfaults.
@Danielm103 28 วันที่ผ่านมา ⁺⁸
Yeah, no not really. I write with both languages, “How fast is python at..”, it not really a question, because I drop down into C/C++ and write an optimized module.
@michal4561 28 วันที่ผ่านมา ⁺³
@@Danielm103 so you can write the things needed in c++ while keeping the develop time for the general case fastly implemented?
@Danielm103 27 วันที่ผ่านมา
@@michal4561 This is what makes Python amazing. If you follow the paradigm “premature optimization is the root of all evil.” You can happily code along in Python until something becomes a problem performance wise, then look for an optimized module, I.e. similar to how numpy does all the heavy number crunching in C. I do a lot of heavy computations in my work, so I write the stuff that needs to be fast in C++ and call it from python
@phitc4242 26 วันที่ผ่านมา
@michal4561 he's built different. a gigachad so to say. he does the shit you guys using regular python don't want to do - writing real optimized code. check what language your favorite python modules are written in - most of the time in C/C++. and python is just a wrapper for those two. Without those things written in C/C++ (or even assembly) python would never in a billion lifespans of the universe be as fast as it is today.
we have to be honest here and accept that fact. and be thankful for a moment.
Also, I have nothing against a fast python, I just want to make sure we all have a reality check here. And I love C.
@hevad 22 วันที่ผ่านมา ⁺⁴
Yes, but once you put it in production, your Python implementation continues to drag on every job it runs.
Also, writing the optimized Python implementation seems to take just as long than a reasonable C++ implementation; if not longer.
@rgndn_bhat 19 วันที่ผ่านมา ⁺²
Nice one, Doug. My Cpython implementation finished in 64 seconds on M2 MacBook air, almost the same approach - memory mapped, multi processing and chunks
@dougmercer 19 วันที่ผ่านมา
That's pretty good! So close to sub 1 minute mark
Is it possible to release the GIL and do multithreading? That would probably save time.
@richardrubin2192 หลายเดือนก่อน ⁺³
This is great - thanks, Doug!
@dougmercer หลายเดือนก่อน
Thanks for watching! =]
@nullzeon 29 วันที่ผ่านมา ⁺⁴
how am I just finding out about this channel, editing, knowledge, this video was fantastic!
@dougmercer 29 วันที่ผ่านมา ⁺¹
Thanks! Glad you enjoyed it =]
@thahrimdon 18 วันที่ผ่านมา ⁺¹
This is amazing! I was in it with you for the long haul. Had me smiling and frowning the whole way! Great video!
@dougmercer 17 วันที่ผ่านมา
Hahaha awesome =] thanks!
@tzacks_ หลายเดือนก่อน ⁺¹⁶
in other words, getting performance out of python means rewriting the code in C or using a library written in C :)
@dougmercer หลายเดือนก่อน ⁺⁴
PyPy is written in RPython, which targets C. A lot of compilers target C ¯\_(ツ)_/¯.
@mutatedllama 29 วันที่ผ่านมา ⁺¹
Amazing video, thanks for posting. Learning about polars and duckdb gave me a real-world takeaway that I could bring to my job. Liked, subscribed and saved!
@dougmercer 29 วันที่ผ่านมา ⁺¹
Awesome! Glad to hear =]
@jamesborden7105 23 วันที่ผ่านมา ⁺³
Practically speaking, I prefer the polars implementation over the duckdb because I'd rather chain function calls instead of manipulating text when doing data analysis in Python. But maybe a library like pypika would solve this?
@danklynn 25 วันที่ผ่านมา
Excellent editing and presentation. Thanks!
@dougmercer 25 วันที่ผ่านมา
Thanks =]
@weaselontheclock6695 หลายเดือนก่อน ⁺¹
Nice video, keep it up. Would love to have seen more language comparisons
@dougmercer หลายเดือนก่อน ⁺¹
Good point. A few people have asked about Rust and Go... Will try to do next time!
@weaselontheclock6695 หลายเดือนก่อน ⁺¹
Looking forward to it! Was my first time watching I'm already subscribed :), fantastic quality man
@vitorsilva-or1dj หลายเดือนก่อน ⁺²
thanks for the video!
@dougmercer หลายเดือนก่อน
Thanks for watching and commenting =]
@KieranG 21 วันที่ผ่านมา ⁺¹
Hey man, this is great content and I’m surprised it hasn’t been pushed to my feed earlier. Keep it up
Also 8k subs and a Brilliant sponsorship? Cool shit lolol
@dougmercer 21 วันที่ผ่านมา
Thanks! =] And yeah, I was thankful -- I got two different sponsors around 4k subscribers and turned down a few others. I'll take it as a sign that I'm doing something right ¯\_(ツ)_/¯
@aquacruisedb 29 วันที่ผ่านมา ⁺³
I have no idea what any of this means, and I thought a python was a snake and rust a problem.. BUT, strangely it was entertaining to watch, and very satisfying to see the run times come down!
@dougmercer 29 วันที่ผ่านมา ⁺¹
Hah! Great comment. Thanks for watching =]
@aquacruisedb 28 วันที่ผ่านมา ⁺¹
@@dougmercer It's a testament to your presentation skills that a non-programmer made it to the end tbh. I'm just scratching my head as to why youtube put it in my feed, but I'm not complaining!
@dougmercer 28 วันที่ผ่านมา ⁺¹
@@aquacruisedb the universe is sending you signs to learn to program! Or buy a snake... Or check your car's undercarriage for rust...
@Dan_Diaconescu 25 วันที่ผ่านมา
Amazing video my dude, keep it up!
@dougmercer 24 วันที่ผ่านมา
Thanks! Will do =]
@anon_y_mousse 26 วันที่ผ่านมา ⁺²
Personally, I consider writing fast code to be a matter of experience. If you know the correct methodologies for doing things, then writing a fast solution should be second nature. Take for instance Danny's naive implementation in C, which in the linked article, he states that it took 8 minutes. His justification for writing it that way is that C doesn't have a native hash table implementation, but if you use C and aren't implementing it yourself or have previously implemented it yourself, then you should at the very least know where to get an adequate third-party library. This is also why anyone who's newly getting into programming should only use C if they want to be a good programmer because you'll have to learn how to do so much on your own until you learn what libraries you should use or have your own. Since my computer has lower specs than Danny's, I'm going to test my own library and see how it compares.
@Websitedr 29 วันที่ผ่านมา ⁺²
I'm honestly more impressed by that duckdb implementation I might actually try that on something. 1 billion lines sub 10 seconds nobody should be complaining about it being 'too slow'
@dougmercer 29 วันที่ผ่านมา
Yeah I'm definitely going to use Duckdb more often after this. Seems incredibly powerful for data thats big enough to be a pain, but not big enough to need to be distributed across multiple systems
@AntonioZL 29 วันที่ผ่านมา ⁺³
My main takeaway from this video is that Python is much faster than I thought, and I say this as a Python back-end developer. 9 minutes with the most trivial implementation against 3-ish from Java? I'll take that. I definitely expected 20+ minutes lol
@dougmercer 29 วันที่ผ่านมา
I was shocked when the PyPy + pure python approach broke the 10 second mark...
@tsgraphics1 หลายเดือนก่อน ⁺⁸
Can you make a video comparing the performance of Mojo?
@dougmercer หลายเดือนก่อน ⁺³
I plan to some day, but am waiting on a v1 release.
@randypittman279 หลายเดือนก่อน ⁺³
Does file I/O chunking not really matter for the pure python implementations? That is, is there no gain in reading large chunks of the file into RAM rather than reading line-by-line? Rightly or wrongly (premature optimization) I always have a voice at the back of my head telling me to minimize I/O operations. Especially if the data is cold and on spinning platters!
Super cool video. Switching to bytes and doing your own int parsing were new ideas to me!
@dougmercer หลายเดือนก่อน ⁺²
It might be possible to speed it up more with chunking! I didn't try because I couldn't really wrap my head around a good way of doing it.
If you want to give it a shot, try forking this repo! github.com/dougmercer-yt/1brc
(if you don't feel like generating 13GB of data, you're welcome to send me a gist or link to a fork and I'll try running it).
@Sugar3Glider 18 วันที่ผ่านมา
Convert to Lat/Long, z becomes temperature, translate locations into chosen format and youre gooden. Just need to set the display parameters.
@user-pg9nf2vq8s 10 วันที่ผ่านมา ⁺¹
i would never use python, but i like watching how people optimize the hell out of something.
@dougmercer 9 วันที่ผ่านมา
There's something Zen about it 🧘
@dearheart2 26 วันที่ผ่านมา
Was interesting. It reminds me of back at the university. I was engineering all kind of algorithms. At that time there was no python.
@Finnnicus หลายเดือนก่อน ⁺¹
great production value doug! you'll get many more views if you keep it up
@dougmercer หลายเดือนก่อน
Thanks! I hope so 🤞
@TiagoBonetti 27 วันที่ผ่านมา ⁺²
The SummoningSalt reference was fire!
@dougmercer 27 วันที่ผ่านมา
Thanks =]
@pwhiteOO 19 วันที่ผ่านมา
lol I thought I was gonna be the only one to spot that.
@GLOCKSURU หลายเดือนก่อน
What is the font/theme you use in the images of code? It is so nice.
@dougmercer หลายเดือนก่อน
I use Anonymous Pro font (fonts.google.com/specimen/Anonymous+Pro) and nord-base16 colors when syntax highlighting with pygments (github.com/idleberg/base16-pygments). Nord style is pretty close to nord-base16 though and is more common.
(One minor caveat about the colors: the mapping between tokens and colors is out of date for that repo, so I fixed the colors for nord-base16 on a personal fork).
@6IGNITION9 27 วันที่ผ่านมา ⁺¹
Great video. How do you animate the code?
@dougmercer 27 วันที่ผ่านมา ⁺¹
My current code animation process is a bit of a pain. I made a custom Pygments formatter to create a file that I can copy/paste into my video editor (Davinci Resolve) that makes all the text+ objects be colored appropriately, and then I manually move things around or fade in/fade out.
In the past I've used manim. That also was kind of a pain.
I just started working on a new approach, but it's gonna be awhile before I even know if it's a good idea or not
@frd85 หลายเดือนก่อน ⁺¹
informative video with nice summoning salt vibes. good job.
@dougmercer หลายเดือนก่อน
Thanks =] (and sorry if summoning salt music is stuck in your head now)
@V1ewSh0t 25 วันที่ผ่านมา ⁺¹
@dougmercer I have an idea, what if you use the GPU instead of just the CPU? the GPU is historically faster when running repeating computations (As far as I know) I could be completely wrong about this and if I am, please tell me. But I feel as this could be worth a try! (Great video btw!)
@dougmercer 25 วันที่ผ่านมา ⁺¹
It's a good idea! I saw this submission that uses cuDF + Dask to get 4.5 seconds on their machine github.com/gunnarmorling/1brc/discussions/487
@TheFwip 24 วันที่ผ่านมา ⁺¹
Pendantic nit: at 8:00, you say "casting it as an integer instead of a float."
This should be "parsing," as casting is (usually) used to refer to things that have no runtime cost - e.g telling the compiler "now pretend these four bytes are an int32."
Otherwise, very good video. Curious also which Java runtime you used.
@dougmercer 24 วันที่ผ่านมา
I used openjdk 21.0.2 because I wanted to brew install it, but the actual challenge winner used 21.0.2 graal
@TheFwip 24 วันที่ผ่านมา
@@dougmercer thanks!
@marlan__ 25 วันที่ผ่านมา
Is polars multi processed? Is that something it does automatically or could we see the same improvements by running that multiprocessed too?
@dougmercer 25 วันที่ผ่านมา ⁺¹
I believe it is multithreaded in rust, which saturates all the cores. So, I wouldn't expect multiprocessing it in Python would help
@sharjeel_mazhar 29 วันที่ผ่านมา ⁺¹
Great video sir! 🔥 I've a video request for you. Can you please make a video about coding time critical parts in let's say c++ and then call it from python to save time. There could be many use cases, where we want to do something and python takes forever and the same task can fly through using c++. I hope you understand what I'm tryna say?
Putting simply: Extending Python with C++ or any other language for that matter let's say Java
@dougmercer 29 วันที่ผ่านมา
I don't have a video entirely dedicated to that, but I do have one titled "Compiled Python is FAST" which includes discussions of Cython, which can let you include plain C or C++ very easily.
There other options for making c extension libraries tho
Hope that helps!
@Wurstfinger-rl1zi 23 วันที่ผ่านมา ⁺²
this shit is actual python wizardry
@anneallison6402 15 วันที่ผ่านมา ⁺¹
How do you do the code animations?
@dougmercer 15 วันที่ผ่านมา
I've used two different approaches for animating code.
1. In my early videos I used the `manim` library. The community edition has a Code object.
2. In recent videos, I created a custom Pygments formatter that outputs the syntax highlighted code as a Davinci Resolve Fusion composition.
Both approaches have a lot of problems.
I'm currently writing my own animation library. I may make a video about it soon (but I would probably not be open sourcing the code)
Another option you may find useful is reveal.js . That let's you write code animations in JavaScript, and even has an 'autoanimate" feature that works OK. However, since that's more for live presentations, you would need to screen record if you wanted to make a video
@andersondantas2010 26 วันที่ผ่านมา ⁺¹
[14.5s using rust]
Hi , I did the challenge myself and that was my best time on a M1 with 8GB of RAM. To be honest I used some external dependencies but still enjoyed the challenge haha (first time coding rust). If you don't mind I'd like to discuss some items from your solutions:
1. Have you tested parsing the numbers byte-per-byte?
2. How can your code account for number under the 10 degree mark as they have less than the original digits you parser expects?
3. Have you tried tweaking the chunk size to closer to the cache size? I had my best results reading chunks of 188kb
As I have less memory than the whole file size, mmap didn't gave me the great performance other people had so I stayed with the manual file handling
@dougmercer 26 วันที่ผ่านมา ⁺¹
That seems like a pretty great time! Both my laptop and the official challenge workstation had 64GB of RAM, so I expect that your approach would be even faster on those systems.
1. I did not try parsing byte by byte . Do you have a gist that I could look at to get a sense of how you did it in rust?
2. Numbers in the file can either be -##.#, -#.#, #.#, or ##.#. Even if the temperature is ~0 degrees, it'll be 0.2 instead of just, say, .2, so these four cases are exhaustive. we first check if there is a minus sign. If there is, we effectively shift forward one character. Then, we check where the period is. If the period is the character after the current character, then we know that the number after the potential minus sign is of the form #.#. otherwise, we know it is of the form ##.#.
3. I did not try to mess with chunk size. Another community member submitted a solution to the GitHub that was interesting . Its almost as fast as the doug_booty4 approach and does not use mmap. It had a chunk size parameter and that did affect performance. (Whereas doug_booty4 gets down to like 9.7s on my system, his got to about 10.1). I'm not sure if using a different chunk size for the doug_booty approach would help. It may!
@andersondantas2010 26 วันที่ผ่านมา
@@dougmercer although the chained ifs/elsifs might look like unoptmized, the compiler ends up converting those to jump tables so the processing time is constant
@dougmercer 26 วันที่ผ่านมา
@@andersondantas2010 ah, did you reply with a second comment containing a link? TH-cam might have caught it in a filter, but I don't see anything in my "held for review" comments. if so, maybe just comment back your GitHub username and I'll try to find the gist/GitHub on there ¯\_(ツ)_/¯
@dougmercer 26 วันที่ผ่านมา ⁺¹
@@andersondantas2010 I did try this approach. It was almost as fast, but the approach I listed in the video tends to be slightly faster. github.com/dougmercer-yt/1brc/blob/main/src%2Fdoug_booty4_alternate.py#L8-L18
@joseduarte9823 6 วันที่ผ่านมา
Depending on how large the total sum actually is, using an incremental mean may yield better performance since python won’t need to upgrade the number to a big int
@dougmercer 6 วันที่ผ่านมา
Neat idea... It's worth a shot! Feel free to fork the repo and give it a try
@wanfuse 26 วันที่ผ่านมา ⁺¹
try Cython and serializing the code perhaps? seen this sort if things make a big difference , also profiling the code, also 13GB, if you don't want to bother with chunking then read into memory ahead of time. If nothing else it tells you whether your I/O bound or not
@dougmercer 26 วันที่ผ่านมา
I would def be interested in seeing a Cython version! I do think it's possible to beat this implementation if you can do multithreading instead of multiprocessing... I don't have time to implement it but you're welcome to try!
@Almondz_ 22 วันที่ผ่านมา
What application do you use for the code block display?
@dougmercer 22 วันที่ผ่านมา
Hah, so... It's a bit complicated.
My current approach for animating code is to use a custom Pygments formatter to create a Davinci Resolve Fusion setting file that I can copy/paste into my video editor, then edit it in Davinci Resolve.
This approach has a lot of flaws. (Very hard to find which text+ node has the token I want, very slow to render.
In my old videos , I animated code using the python library `manim`. This also had a lot of flaws (inconsistent behavior, difficult to preview what I'm doing, difficult to deal with things at token level).
I'm currently working on making my own text animation library similar to manim, but more tailored to what I need for my videos. I've made good progress, but it's still a WIP.
There are other off the shelf options that might work for you depending on what you're trying to accomplish (e.g., reveal.js)
@Almondz_ 21 วันที่ผ่านมา ⁺¹
@@dougmercer Oh, that's really cool! Do you have a way I can contact you?
@dougmercer 21 วันที่ผ่านมา
@Almondz_ sure, check my channel's "about" section for my email
@seansingh4421 28 วันที่ผ่านมา
ChemE here not programmer, so would an llm inference server be faster and use comparatively lower resources if it was implemented in C++ than Python ?
@dougmercer 28 วันที่ผ่านมา ⁺¹
Hmm. There's a lot of moving parts to the question.
Generally a server side ML workflow would be accelerated by GPUs (Nvidia graphics cards) or some other purpose built chips (e.g. tensor processing units, TPU).
Code is structured so that they can do as much processing on these purpose built chips as possible, as they are faster or more energy efficient. In the case of Nvidia GPUs, machine learning languages like pytorch effectively marshall the data to the GPU and then execute CUDA code, Nvidias framework for doing computation of the GPU. Once there, python or C is somewhat out of the loop, or at the very least not a significant bottle neck.
@darrenzou2225 26 วันที่ผ่านมา ⁺²
The fastest is of course muti-universe read, which can read all 1 billion rows simultaneously and do it in constant time
@dougmercer 26 วันที่ผ่านมา
At least until causality is deprecated. Then we can get the answer before running the code!
@mrjson3039 หลายเดือนก่อน
First time channel watcher here. Amazing video, thanks for this superb piece of content Mister *checks notes* "Python Jack Black"
@dougmercer หลายเดือนก่อน
HAHAHAHA oh man. I guess I'll take it
@willymcnamara1429 17 วันที่ผ่านมา ⁺¹
interesting video! thank you Doug 🤝 🐍
@dougmercer 17 วันที่ผ่านมา
Thanks!
@ericcartmansh หลายเดือนก่อน ⁺²
Shocked to see the final java result
@dougmercer หลายเดือนก่อน ⁺¹
Me too! Apparently someone's Golang solution got down to 1.1 seconds github.com/dhartunian/1brcgo
@nikilragav 7 วันที่ผ่านมา
are you allowed to use numpy or gpu (torch, cupy, etc)
@dougmercer 7 วันที่ผ่านมา
You could use numpy, but I don't think it would help (the bottleneck is reading the data in).
I did see some use Dask + cuDF (CUDA) and that was very fast. However, it wasn't allowed in the challenge because the evaluation system didn't have a GPU
@nikilragav 7 วันที่ผ่านมา
@@dougmercer ah. Reminds me of another challenge I saw where IO is the bottleneck. Even there I'm wondering if writing the content to the GPU memory and back is too slow
@KaranBulani 27 วันที่ผ่านมา
multi threading and multiprocessing is not supported in python correct? due to global interpreter lock. how did he do at 3:56
@dougmercer 27 วันที่ผ่านมา ⁺¹
Python fully supports multiprocessing. You just basically have to pay the overhead of serializing/deserializing data between the parent and child processes.
Multi*threading* does not work well because of GIL
@why_tf_you_do_tis7941 หลายเดือนก่อน ⁺²
Honestly great video, i recently saw theprimeagen's stream on the GO implementation of this and thought to myself how fast can it be done in python and just cause i was bored i tried doing it on my own, and one night+3 restarts(i read the entire file in memory and my swap setup is sh~t so it bricked my computer oops~)+pandas,xarray,numpy and dask implementations later i gave up at cause i dont have that long of an attention span. but this video and just seeing more approaches to this problem i might try again (definitely not pure python tho)
@dougmercer หลายเดือนก่อน ⁺¹
Prime got me interested in it too!
It's definitely a fun problem. Without PyPy, Python would really struggle.
At some point, I'd like to try a language I'm not familiar with and see how far I can get.
Thanks for watching and commenting!
@DareDevilPhil 23 วันที่ผ่านมา ⁺¹
I feel like this should be a single core challenge for purity. I'm still watching though, see if I change my mind by the end.
@dougmercer 23 วันที่ผ่านมา
So, did you change your mind by the end?
@DareDevilPhil 23 วันที่ผ่านมา ⁺¹
@@dougmercer can't say I did :)
@dougmercer 23 วันที่ผ่านมา
@@DareDevilPhil hahaha fair enough =]
@CottidaeSEA 5 วันที่ผ่านมา ⁺¹
I shit on Python a lot for being slow, but honestly, 8-10 seconds to read 1 billion rows is sufficient in most scenarios.
@andydataguy หลายเดือนก่อน ⁺¹
Great video!
@dougmercer หลายเดือนก่อน
Thanks Andy! Much appreciated =]
@poutineausyropderable7108 24 วันที่ผ่านมา
Tbh, The usage of global variable clearly defined as constant is for code readability.
Replacing constant with magic number is not worth the performance boost, especially when in a strongly type language, the compiler will optimise it.
@dougmercer 24 วันที่ผ่านมา ⁺¹
Yeah, in hindsight I agree.
It doesn't seem to make any difference in performance-- I was mistaken. A community member submitted what i'd call a "well engineered" version of the code that had a proper CLI, a few debugging options, and reintroduced the globals. It was almost as fast as the fastest version I had (but still a fraction of a second slower cause he didn't use mmap)
@BrianStDenis-pj1tq 24 วันที่ผ่านมา ⁺¹
Great video. I'd like to know how Java is faster than C.
@dougmercer 24 วันที่ผ่านมา
Thanks!
I'm not 100% sure (I'm admittedly not qualified to speculate on Java/C performance optimization). The thoughts I've saw are
1. Many people focused on Java (since original challenge language), fewer focused on C. So the Java implementation is super well optimized and the C could have left some potential improvements missing
2. The JVM JIT helped out
¯\_(ツ)_/¯
@themanwhobeateinstein หลายเดือนก่อน ⁺¹
Which Java exactly was it, I need to know so I can use it
@dougmercer หลายเดือนก่อน
github.com/gunnarmorling/1brc?tab=readme-ov-file#results check out the top result. JDK 21.0.2-graal
@themanwhobeateinstein หลายเดือนก่อน ⁺¹
@@dougmercer Thanks 👍
@overloader7900 19 วันที่ผ่านมา ⁺¹
back in c i use characters in single quotes instead of ascii numbers, like '-' instead of 45, thats more obvious
@dougmercer 19 วันที่ผ่านมา
Yeah. I tried to do something like that, but b"abc"[0] returns a number whereas b"abc"[:1] returns b"a".
I could have used ord(b"a") but I was trying to inline as much as possible to be safe ¯\_(ツ)_/¯
@wlockuz4467 29 วันที่ผ่านมา ⁺⁶
When I see videos like this, I feel like I know nothing about programming. I have been a software engineer for over 3 years now.
@dougmercer 29 วันที่ผ่านมา ⁺³
It's never to late to learn new stuff!
Play with a new library or start a project that's way different than your usual work
I used to only know Excel, visual basic, and Matlab. Over time, I found excuses to experiment with Python, Linux, git, and docker and I became a much better developer because of them.
Three years is still super early in your career. Continuous learning and intellectual curiosity is the most important skill a dev can have.
@__python__ 27 วันที่ผ่านมา
Thanks @dougmercer for this video, but in the polars variation, the speed cannot be solely ascribed to the Python language, as you are likely aware of the underlying programming language employed by polars.
@dougmercer 27 วันที่ผ่านมา
I do say that Polars is implemented in rust, and put it in the "Python-ish" section for that reason
@VorpalForceField 25 วันที่ผ่านมา
Very Cool ..!! Thank You for sharing .. Cheers :)
@dougmercer 25 วันที่ผ่านมา
Thanks for watching =]
@Bernarditete 29 วันที่ผ่านมา ⁺¹
I would love to see a Mojo implementation
@dougmercer 29 วันที่ผ่านมา ⁺¹
I do plan to try Mojo in some future videos.
I have two requirements before covering them: language is open sourced (recently done) and they have a stable v1 release (hopefully sometime soon)
@this-one 13 วันที่ผ่านมา
Would it count as Python if we write it as a module in C?
@dougmercer 13 วันที่ผ่านมา ⁺¹
I'm no philosopher, but this gives me ship of theseus vibes. so... maybe technically but I don't feel good about it
@oliviarojas7023 หลายเดือนก่อน ⁺¹
Yo Doug... the repo is only showing in your recent commits... not sure if that was intentional, but it took me an extra click to get there haha... about .05 extra seconds, and I think we can do better.
@dougmercer หลายเดือนก่อน
Oh hmm, I think I put it under my dougmercer-yt organization instead of my dougmercer user. Sorry for the confusion, but glad you found it =]
Oh, and good luck! I'd love for someone to get this down to like 5 seconds.
@oliviarojas7023 29 วันที่ผ่านมา ⁺¹
Oh I can't beat that haha.... I was being stupid about the extra time it took to get to your repo.... I was just goofin though ;].... love your channel btw..... just found you and you are my new go to... low level is a great name for what I was looking for! Cheers
@oliviarojas7023 29 วันที่ผ่านมา ⁺¹
Oh shit I was thinking of another channel I recently ran into @lowlevellearning ... yall both got the chops though.... Doug mercer is a good name too hahaha sorry
@dougmercer 29 วันที่ผ่านมา
LLL is great too =]
@Pritam252 13 วันที่ผ่านมา
Does this video's measurements get affected by Disk speed?
@Pritam252 13 วันที่ผ่านมา
(I guess every experiment will always have some slight noise)
@dougmercer 13 วันที่ผ่านมา
Oh definitely. The fact that I have a lot of RAM and have a solid state drive also make this much faster than if I had very little RAM and only a HDD.
I also had some background processes running, which add a bit of noise to the measurements
That said, I think my system is roughly on par specs wise with the challenge's evaluation system
@BraxtonMeyer 28 วันที่ผ่านมา
which font are you using.
@dougmercer 28 วันที่ผ่านมา
Anonymous Pro
@cottawalla 8 วันที่ผ่านมา
I couldn't get your opening "performance critical python" out of my head and so missed the entire rest of the video.
@dougmercer 8 วันที่ผ่านมา
¯\_(ツ)_/¯
@bryanbischof4351 26 วันที่ผ่านมา
The cultural impact of summoningsalt on nerds is unmatched
@DarkZeros หลายเดือนก่อน ⁺⁵
Nobody has actually tried on the C side. Because I am sure it can beat java or at least, get same results
@dougmercer หลายเดือนก่อน
Give it a shot! Info for the C implementation is here www.dannyvankooten.com/blog/2024/1brc/
@DarkZeros หลายเดือนก่อน ⁺¹
@@dougmercer Thanks! I might do! (Looks like a small enough problem to give a try)
@irfanfauzi8704 หลายเดือนก่อน ⁺¹
This is great. But did I miss numpy in your vids ?
@dougmercer หลายเดือนก่อน ⁺¹
It wouldn't help with this problem, because so much of the work is IO + dealing with scalars
@irfanfauzi8704 28 วันที่ผ่านมา
Interesting. I should learn more. Thanks for replying
@gagers78 24 วันที่ผ่านมา ⁺¹
Using ctypes and optimized c code you can write a shared library that runs the c code in python getting the speed of c whilst pretending it's still python. That would be my attempt
@dougmercer 24 วันที่ผ่านมา
I mean... technically ctypes is a built in library... ¯\_(ツ)_/¯
@incremental_failure 28 วันที่ผ่านมา
Numba comparison would've been interesting, probably combined with numpy in the compiled function.
@dougmercer 27 วันที่ผ่านมา ⁺¹
Hmmm, I'm not sure of the top of my head how I'd do it. I worry that file I/O would make it hard to only use valid Numba.
That said, I am a big fan of Numba! I did another video (Compiled Python is FAST) and it showed how awesome Numba can be
@falklumo 18 วันที่ผ่านมา
Any close-to-optimum solution should basically measure the filesystem's read performance and nothing else. Which is 2.7s on a fast 10 GB/s SSD with 20 char station names on average. So, the winning 1.5s somehow already defeat physics ;) Which shows how meaningless these synthetic competitions really are. Except that Python remains slow what we knew beforehand. Of course, the file may have been allowed to be cached in DRAM which is likely to happen in a 64GB M1 Max Mac system. With its 40x memory bandwidth over SSD, file read can then be in the 0.1s range. Except that I don't think filesystems are this efficient even when cached.
@rushcoc9605 12 วันที่ผ่านมา ⁺¹
😮😅can you tell me how to measure how much time it take which part of code just by looking how ?
@dougmercer 11 วันที่ผ่านมา ⁺¹
1. A bit of intuition
2. A bit of being totally wrong (removing the constants that indicated which column was min, max, count, sum didnt speed up performance... I was trying too many things at once and accidentally bundled that change in with something else)
3. I used a lot of time.perf_counter() to measure that time that certain operations took and A/B tested them
Normally, in CPython, I would typically profile using something like PyInstrument or other similar line-level profilers
@MartialBoniou 25 วันที่ผ่านมา
I filter a 7GB of amazon books TSV data in 5 or 6 seconds in AWK (mawk or GNU awk; on an outdated macbook air M1). Otherwise, +1 for DuckDB (not sponsored)
@dougmercer 24 วันที่ผ่านมา ⁺¹
I do think filtering is an easier task than aggregating. These folks seemed to have a hard time getting a particularly fast awk implementation github.com/gunnarmorling/1brc/discussions/171 . I am not an awk wizard though so I can't really assess how good their code is
@jz8741 28 วันที่ผ่านมา
Pardon my ignorance, but I thought because of the GIL python couldn't do multi threading. Can anyone tell me what I'm missing?
@dougmercer 28 วันที่ผ่านมา
This used multiprocessing, not multi threading.
Multiprocessing creates a separate python process for each worker and serializes/deserializes the data moved between the child processes and the main one
@vlc-cosplayer 13 วันที่ผ่านมา
Me, approaching this as an engineer:
- Read a random subset of the data
- Do the computation on that
- Yeah that's close enough lmao, interpolation will take care of missing values
@dougmercer 13 วันที่ผ่านมา
Hah! Working smarter not harder 🚀
@vlc-cosplayer 13 วันที่ผ่านมา
@@dougmercer I remembered a video I watched about HyperLogLog. When working with extremely large datasets, a fast approximation may be more desirable than getting the correct answer, but only after a long time. 👀
It'd be interesting to measure how good an approximation you can get using only a fraction of the data. E.g., would using 10% of the data get 90% of the way to the correct answer? You probably don't need 100% accuracy all the time. In fact, your data may not even be 100% accurate to begin with!
To put the cost of precision in perspective, getting 99% uptime is relatively easy (that's 80 hours of downtime/year), but every additional 9 after that becomes exponentially more expensive. 99.9% is 8 hours. 99.99% is only 1 hour, 99.999% is only 5 minutes, go to the bathroom and you'll miss that. 💀
@priteshsingh4055 หลายเดือนก่อน ⁺¹
amazing video
@dougmercer หลายเดือนก่อน
Thanks! =]
@sohamtilekar5126 26 วันที่ผ่านมา ⁺¹
You can disable gc
@dougmercer 26 วันที่ผ่านมา ⁺¹
Hmm, that's an interesting idea...
@dougmercer 26 วันที่ผ่านมา
So, I tried it, and it is still pretty close in terms of performance.
(1brc) ➜ 1brc git:(main) ✗ bash eval.sh | python calc_stats.py
pypy3 src/doug_booty4.py: 9.903 seconds
pypy3 src/doug_booty4_no_gc.py: 9.907 seconds
(1brc) ➜ 1brc git:(main) ✗ bash eval.sh | python calc_stats.py
pypy3 src/doug_booty4.py: 9.846 seconds
pypy3 src/doug_booty4_no_gc.py: 9.897 seconds
(1brc) ➜ 1brc git:(main) ✗ bash eval.sh | python calc_stats.py
pypy3 src/doug_booty4_no_gc.py: 9.881 seconds
pypy3 src/doug_booty4.py: 9.920 seconds
Maybe slightly better?
@mahdirostami7034 12 วันที่ผ่านมา
8:53 I cannot believe writing a parser would gain any performance considering the default one is probably implemented in C. I assume this is a case of pypy optimizing while running and I'm wondering if running this same script with cpython would result in worse performance.
@dougmercer 11 วันที่ผ่านมา ⁺¹
I expect that the custom parser would have worse performance in plain CPython, but I didn't test it
@pietraderdetective8953 หลายเดือนก่อน ⁺¹
Cython is my fave when I need to speedup thing and use the C-engine inside Python.
a well optimized Cython code is close to C speed.
@dougmercer หลายเดือนก่อน ⁺³
Cython is great! I used it in my video "Compiled Python is Fast", but didn't try it in this one.
Definitely feel free to give it a shot in the 1brc and let me know how well it performs!
@BosonCollider หลายเดือนก่อน ⁺¹
The main complaint I have about Cython is that it is... not very expressive. I'd rather just be writing Rust with PyO3
@dougmercer หลายเดือนก่อน
pyo3 is a great option. I'm basically farming "why not rust" comments for these performance optimization videos and will eventually do a video on it
@Markyroson 27 วันที่ผ่านมา
I wonder how Pybind11 would do. I’ve personally seen orders of magnitude improvements with data processing vs pure python.
@dougmercer 27 วันที่ผ่านมา
I do feel that using pybind11 would be considered Python-ish (not pure Python). But you're welcome to try if you'd like to fork the repo in the video description =]
@alexnolfi3730 11 วันที่ผ่านมา ⁺¹
did you test out pandas to see how much slower it was than polars?
@dougmercer 11 วันที่ผ่านมา ⁺¹
It's way slower
Using the pandas implementation in here github.com/Butch78/1BillionRowChallenge/blob/main/python_1brc%2Fmain.py takes about 150s, whereas the polars implementation takes 11-12s
@kiffeeify 29 วันที่ผ่านมา ⁺¹
The root cause is the CSV file. try doing this without parsing strings to floats e.g. with parquet or even uncompressed arrow arrays:D
@dougmercer 29 วันที่ผ่านมา
I did at the end with duckdb and got about 5ish seconds. Definitely helped compared to 9, but still some work to do to achieve Java speeds
@ardenthebibliophile หลายเดือนก่อน ⁺²
Wouldve loved to see a pandas attempt just as a benchmark
@dougmercer หลายเดือนก่อน
It's bad... haha. You can try running the pandas version here, github.com/Butch78/1BillionRowChallenge/blob/main/python_1brc%2Fmain.py
@ardenthebibliophile หลายเดือนก่อน
@@dougmercer not particularly interested in downloading and running it myself. Is the result posted somewhere? Hard to find from the git repo alone
@dougmercer หลายเดือนก่อน
@@ardenthebibliophile I will run it later after I settle in
@ardenthebibliophile หลายเดือนก่อน ⁺¹
@@dougmercer I really appreciate it.
Also, I am a recent viewer of the channel, saw you discuss on Reddit. I very much appreciate your editing style. Well done
@dougmercer หลายเดือนก่อน ⁺²
@@ardenthebibliophile Thanks so much! I appreciate it =]
On the topic of pandas-- I just ran three trials that took around 2:30s flat.
Few caveats being:
* I have a bunch of chrome windows open and am doing some other tasks (whereas with the full video I did 5 trials, take average of middle three, with no other user stuff running besides the terminal and background processes).
* I didn't bother to format the output in the correct format (but that doesn't take more than a fraction of a second anyways)
So, quite a big jump between 150s for pandas to ~11-12s for polars.
Hope that helps! (and thanks again for the nice comments!)
@12346798Mann 24 วันที่ผ่านมา
I have a csv file with 1.4gb and 11 million rows. The data comes from around 60.000 XML files, which are compressed with gz. My shitty python script first unzips the files and parses combines it to a single CSV. It's slow and eats memory, but it's good enough because it's not live
@pouet4608 26 วันที่ผ่านมา
Why use a language instead of chaining unix tools together?
@dougmercer 26 วันที่ผ่านมา
No reason! I'd love to see a solution that is just chained unix tools. Here is a thread of people discussing awk and parallel, but I imagine there are other ways...github.com/gunnarmorling/1brc/discussions/171
@user-yp7eq7tm4k 28 วันที่ผ่านมา
Great video
@dougmercer 28 วันที่ผ่านมา
Thanks!
@jackkraus6948 24 วันที่ผ่านมา
Doing some data processing in python which takes 26 minutes to run so this may be beneficial for me lol (most of it is pymongo but still)
@dougmercer 24 วันที่ผ่านมา ⁺¹
I've never done anything with pymongo (or mongodb in general), but good luck!
@SergioTejedor 29 วันที่ผ่านมา ⁺¹
What about cuDF ?
@dougmercer 29 วันที่ผ่านมา ⁺¹
Didn't try it but don't think it would perform very well. You are welcome to fork the repo in my description and give it a shot tho!
@dougmercer 27 วันที่ผ่านมา
@SergioTejedor , so I was wrong and stumbled across this github.com/gunnarmorling/1brc/discussions/487 pretty impressive!

ต่อไป

เล่นอัตโนมัติ