Making It FAST - 1 Billion Row Challenge in Go

แชร์
ฝัง
  • เผยแพร่เมื่อ 24 พ.ย. 2024

ความคิดเห็น • 245

  • @jaryd_yarid
    @jaryd_yarid 7 หลายเดือนก่อน +63

    Too complicated. Just use Python for loops.

    • @asdanjer
      @asdanjer 5 หลายเดือนก่อน +2

      I mean... It is faster than 14 seconds XD

    • @vncstudio
      @vncstudio 2 หลายเดือนก่อน

      th-cam.com/video/utTaPW32gKY/w-d-xo.html Very interesting optimizations in Python.

    • @themuslimview
      @themuslimview 2 หลายเดือนก่อน

      Hahaha. Wonder how fast python can get for this.

    • @ratsock
      @ratsock หลายเดือนก่อน

      Few ppl got python to around 9-10s

  • @ryanquinn1257
    @ryanquinn1257 9 หลายเดือนก่อน +54

    Your comment on the Apple silica is literally why I bought my first max after years of fighting.
    It’s performant efficiency. My ability to be off charger for wayyyy longer under load. I got to test drive a x1 Lenovo and it last 1.75 hours under decently heavy load while my M1 Pro was about 7hrs.
    It’s something I’ve been annoyed at intel for a while at just pushing inefficient power to get more and more diminishing clock gains. Performant battery life is one of the best innovations of the last couple decade imo along with SSD storage.

    • @mitchierichie
      @mitchierichie 9 หลายเดือนก่อน +6

      The battery life on the M1 is insane. The bootup time is practically instant, too.

    • @pieterrossouw8596
      @pieterrossouw8596 9 หลายเดือนก่อน +2

      I've been a Windows/Linux guy, had plenty of fast ASUS gaming laptops and Dell Precision Workstations. Switched to a M2 Pro 32GB and not sure what could convince me to switch back at this point.

    • @theairaccumulator7144
      @theairaccumulator7144 4 หลายเดือนก่อน

      That's because Macbook CPUs are upscaled iPhone CPUs. Not a real processor.

    • @sc.frederick
      @sc.frederick 3 หลายเดือนก่อน

      @@pieterrossouw8596 I was a linux guy too for years, windows before that. Took a risk and bought an M1 pro 32gb the day they launched, the full performance with ample time away from the wall was too hard to pass up. Want to switch back to linux, but the battery life, quiet operation, trackpad, etc are just too hard to leave. Other manufacturers need to step up their game...

    • @maimee1
      @maimee1 2 หลายเดือนก่อน

      ​@@pieterrossouw8596 Price, a desk job, CPU-intensive/GPU-intensive workload? The moment you want to start moving without a charger though, it's probably game over for Intel laptops. Maybe AMD has a chance? Never tried that.

  • @jperkinsdev
    @jperkinsdev 9 หลายเดือนก่อน +60

    Hey Prime, this was one of the best programming videos I've ever seen. I really want to see you tackle some of these problems, and venture into that "unsafe" zone of Go that I'm so scared of.

    • @dejangegic
      @dejangegic 4 หลายเดือนก่อน

      what's the "unsafe" part of go?

    • @jerofff
      @jerofff 4 หลายเดือนก่อน

      @@dejangegicditching go and using js

  • @ErikBackman242
    @ErikBackman242 8 หลายเดือนก่อน +18

    Kudos to the Java community - all jokes about the language and the jvm aside - for launching a challenge that inspired extensive collaboration and interesting articles even outside their own "domain". Credit where credit is due. Applause.

  • @nightshade427
    @nightshade427 9 หลายเดือนก่อน +24

    My first stab would be to memory map the file. Split the data into number of hardware threads available, have each thread process it's chunk and create its own hash of sum, min, max, count, etc. Then join the threads and tabulate the final results. This way each thread is independent and there is no locking on anything. Not sure how it would perform in go or any other language, but that's the basic approach I would start with.

    • @daasdingo
      @daasdingo 9 หลายเดือนก่อน +4

      Exactly! I was so frustrated that they have such complex structures, when there is a much simpler approach that is very likely much faster!

    • @daasdingo
      @daasdingo 9 หลายเดือนก่อน +4

      The only obvious issue I can see with this is the overlap of IO with compute, so might be necessary to tune up thread count significantly higher than hardware thread count

    • @nightshade427
      @nightshade427 9 หลายเดือนก่อน +5

      Looks like someone implemented a c version of what I mentioned above and it runs in about 1-1.5s

    • @egorandreevich7830
      @egorandreevich7830 9 หลายเดือนก่อน

      One of the main principles of golang is "Share Memory By Communicating", which is why this route was chosen, I think.

    • @kippers12isOG
      @kippers12isOG 7 หลายเดือนก่อน +1

      This is what the fastest Java impls all did. 🏆

  • @KangoV
    @KangoV 9 หลายเดือนก่อน +34

    This was awesome to see the process that the developer went through. Good job. But, it amazes me that Java's JVM is so powerful that it can do this in 1.8 seconds without a native binary and 1.5 seconds with GraalVM.

    • @doublekamui
      @doublekamui 9 หลายเดือนก่อน +3

      the jvm monitors frequently executed code and compiles them into native binary code on the fly. this means the next time the code is called, it can run much faster than before because it's binary code and tailored to the specific processor

    • @javierflores09
      @javierflores09 9 หลายเดือนก่อน +8

      @@doublekamui JIT doesn't really form part of the equation in this scenario considering this challenge is not a long-lived service but a short-lived program, hence why there are benefits in performance from using Graal native binaries. The reason this is very fast is because they're making use of recent additions in the Java API to do manual memory handling in order to load the file faster, aside from being smart with the data structure of course.

    • @neonmidnight6264
      @neonmidnight6264 9 หลายเดือนก่อน +2

      C# does it in 1.2s (because Java does not have SIMD (Panama vectors have worse codegen and are higher level, have crossplat issues currently) primitives that allow to express the necessary code).

  • @StrengthOfADragon13
    @StrengthOfADragon13 8 หลายเดือนก่อน +2

    I love both this challenge and this article. It's an excellent example of the way I would expect optimization to work in industry, albeit in a more approachable format. Step 1 is make it work, then start to analyze and optimize approachable bottle necks iteratively until you hit your goal. What the article skips over briefly is the extra reading and deep diving you would need to do to go into what things seem like ACTUAL potential time saves if you aren't familiar w/ optimizations like this from the outset.
    Would LOVE to see something like this done for a website or other UI. (Make it work, then making the response time under 1/4 of a second, or whatever other goal you would have. That case is more of finding unexpected bottle necks and cutting them out)

  • @aseeralfaisalsaad
    @aseeralfaisalsaad 9 หลายเดือนก่อน +4

    I recently adapted Go and absolutely loving it as my personal primary backend language and got away from JS/TS for everything. I still use NodeJS those for current job but Go gives me a minimal syntax vibe of Python, modern C++ like characteristics with speed and concurrency.

  • @Nenad_bZmaj
    @Nenad_bZmaj 9 หลายเดือนก่อน +3

    Perhaps, instead of using a map, we could construct an array . I don't know how fast the hash function for maps is in Go, but perhaps it's worth trying the following. Convert three bytes from each name into 15 bits. Three-byte combination is unique for 10.000 names. We can, for example, take first three bytes ( If the characters in the string (name) are not all from ASCII, we should take the 4th, 8th and 12th byte, with len check first) Then, by using fast bit-operations, from each byte extract bits at 5 characteristic positions to three contiguous bit-chunks. This gives an unsigned number < 32.768 which is the index of an array element, a sub array consisting of min, max and avg. Sort using radix sort with the radix of 5 bits (the order should remain alphabetical, if I'm not wrong). The array will be two-thirds empty, though, but this is a rather small array, so it doesn't matter. Actually, not sure about preserving the alphabetical order if characters are not all from ASCII, but in that case we can still use the same procedure, bu store also the name as a string in the array element, and then sort by runes also using radix sort.. The first and last out of three stages of the radix sort can be done in parallel.

  • @NuncNuncNuncNunc
    @NuncNuncNuncNunc 9 หลายเดือนก่อน +87

    You need to use monoids -- Haskell Fanboy
    I'd think IO would be the first thing to tackle -- find fastest way to read huge file without doing anything and then find the fastest way process each chunk, then fastest way to reduce the chunks.

    • @locusteamfortress3868
      @locusteamfortress3868 9 หลายเดือนก่อน +3

      Good old map reduce problem

    • @rj7250a
      @rj7250a 9 หลายเดือนก่อน

      First thing is to copy each row 1 million times, because side effects are LE EVIL!

    • @NuncNuncNuncNunc
      @NuncNuncNuncNunc 9 หลายเดือนก่อน

      @@rj7250a Uh, yeah. If your scheme involves mutating the original data, you've already failed.

    • @rj7250a
      @rj7250a 9 หลายเดือนก่อน

      @@NuncNuncNuncNunc well you need do 1 copy from disk to memory anyway, because computers can only copy, not move data. So in any way, you are keeping the original file intact.
      But side effects are essential for most fast algorithms. That is why sorting alorithms modify the array in place, imagine copying a array with 100 MB every time that a swap happens, 90% of the sorting time would be garbage collector work. If you want to keep the original array, you just copy it before the sorting and sort that copy, 1 copy vs millions of copies.

    • @NuncNuncNuncNunc
      @NuncNuncNuncNunc 9 หลายเดือนก่อน +1

      @@rj7250a Moving and mutating are not equivalent. It's splitting hairs to say a copy from disk is not moving data, but still beside the point that read is non-mutating. Taking up the Haskell Fanboy mantle, all copies are equivalent so n copies are accomplished in constant time and memory.

  • @josegabrielgruber
    @josegabrielgruber 9 หลายเดือนก่อน +7

    Great article, loved the scientific approach of handling the challenge!

  • @hirenpatel6118
    @hirenpatel6118 9 หลายเดือนก่อน +3

    I'd probably do a boss worker model on a chunk of the lines, for each thread return a map of the thread local minimum/maximum/avg/count. On return you can then combine. There's no need for input state to be sync'd.
    Also as I'm watching more of this, IDK if an array vs struct would be that much different. Struct should be packed similarly to the array in this use case. You could probably even get away with using a tuple too. An easy way to validate this would be to look at the IR for go for this program.

  • @spicynoodle7419
    @spicynoodle7419 9 หลายเดือนก่อน +57

    I would read 1k lines per thread and base the hash on the bucket index (bucket-1000-0, bucket-1000-1..., bucket-2000-0, bucket-2000-1...)

    • @watchbro3319
      @watchbro3319 9 หลายเดือนก่อน +9

      nah man u kiddin its 2024 i used PrimaGENAI 6x9B STFP model it finsihed a billion rows like in 0.00000000069s kk😎

    • @JacobyB
      @JacobyB 9 หลายเดือนก่อน

      what @@watchbro3319

    • @CurtisCanby
      @CurtisCanby 9 หลายเดือนก่อน +3

      Do it. See if you get better or worse results… FOR SCIENCE

    • @spicynoodle7419
      @spicynoodle7419 9 หลายเดือนก่อน +1

      @@CurtisCanby Imma do it in Zig cuz i'm learning it now

  • @TurtleKwitty
    @TurtleKwitty 9 หลายเดือนก่อน +3

    Array vs struct in that usage is the same, but you'd probably want to use a struct so you can try padding it out to cache line size and align the entire thing on cache line

  • @vallariag
    @vallariag 9 หลายเดือนก่อน +6

    Really great article! Followed her on Twitter.

  • @machinima1402
    @machinima1402 9 หลายเดือนก่อน +65

    would love to see you solving the challenge iteratively

    • @ashersamuel958
      @ashersamuel958 9 หลายเดือนก่อน

      yeah

    • @nyahhbinghi
      @nyahhbinghi 9 หลายเดือนก่อน +1

      how would an iterative solution be different?

    • @shadowpenguin3482
      @shadowpenguin3482 9 หลายเดือนก่อน +3

      I think OP meant iterative as in iterative software development, starting off with a first working solution and gradually improving it

    • @ashersamuel958
      @ashersamuel958 9 หลายเดือนก่อน

      @@nyahhbinghi more pain more good

  • @jrgf6778
    @jrgf6778 9 หลายเดือนก่อน +39

    The record in java is 1.4 seconds, prescinding of the garbage collector, using multithreading with 8 cores, and using the Vector API

    • @Gusto20000
      @Gusto20000 9 หลายเดือนก่อน +9

      It’s 02.957 on 32-core server. With 10k stations which is only fair way to evaluate because it says in the rules - 10k stations. 1.535 result achieved on quote “The 1BRC challenge data set contains 413 distinct weather stations, whereas the rules allow for 10,000 different station names”

    • @jrgf6778
      @jrgf6778 9 หลายเดือนก่อน

      Check again my friend github.com/gunnarmorling/1brc

    • @jrgf6778
      @jrgf6778 9 หลายเดือนก่อน +1

      ​@@Gusto20000 Check again the gunnar morling repo the 1st place is 1.536 sec uses a all of the above, the 32 cores that you said a treemap to get the info organized and bit operations to calculate the values per station

    • @jrgf6778
      @jrgf6778 9 หลายเดือนก่อน +4

      Also there is a discussion if a linux kernel can be optimized to increase perfomance, because it hits the perfomance of compiling and executing the bytecode

    • @nafakirabratmu
      @nafakirabratmu 9 หลายเดือนก่อน +2

      The data is also preloaded on a ramdisk, so it's pretty much IO unbound

  • @u9vata
    @u9vata 9 หลายเดือนก่อน +6

    Relevant info for those who say "Java can do it in 1.5 seconds" beware that the java tests ran on RAMDISK of this machine:
    "dedicated server (32 core AMD EPYC™ 7502P (Zen2), 128 GB RAM).
    Programs are run from a RAM disk (i.o. the IO overhead for loading the file from disk is not relevant), using 8 cores of the machine."
    It is not a bad article, but as a hiperf-nerd I find it also fascinating how the guys thinking process is: Like calculating sum/min/max instead of storing the file in memory first is so annoying he does not do at first try even without thinking (I saw this surprised prime too). Also mmap-ing the whole thing should work better (even with ramdisk I feel) plus doing threads the way the guy does here seems very wasteful. I literally think that a simd-style parallelism wins over threads here very bigly.
    But if one really wants to use threads I imagine it should be more like somehow doing big chunks of the file I/O separately - like starting from line X (by literally doing A, B, C, D threads or even 8 threads that read only parts of the file - but HUGE parts). This means to have 4-8 min-max-count-sum values for each thread for each processed block (that can still happen with simd) and then after all of them finished (for which you not need any inter-thread communication and can do it fully lock-free) you just mix together that result.
    I think on that heavy EPYC server on a ramdisk, with well set up I/O in C should be done in milliseconds or 10-100 of milliseconds at most to be honest. On "real" machines it should be bottlenecked by I/O performance of your hardware.

  • @constantinegeist1854
    @constantinegeist1854 8 หลายเดือนก่อน

    Channels IIRC have mutexes internally. What I'd do: take a hash of the name, take a modulo, use it to locate the target goroutine in a finite list. Each goroutine is associated with a preallocated slice + start/end indexes (implement the queue manually, a lock-free circular buffer). That would avoid synchrosization costs. Enqueue while you read. Reading can be parallelized too (if it's not an HDD). Calculate on the fly without storing in intermediary data structures.

  • @owlmostdead9492
    @owlmostdead9492 9 หลายเดือนก่อน +15

    4:10 the apple encoders are very efficient BUT they're much much worse in quality at low bitrates (streaming bitrates). Currently the best quality h.264, h.265 and Av1 goes to Intel, then Nvidia, AMD and dead last Apple. To give you a reference Apples h.265 is about as good as Intel's h.264 at the same bitrate, so ~2x worse than Intel.

    • @ark_knight
      @ark_knight 9 หลายเดือนก่อน

      And AMD has the smallest file size.

    • @owlmostdead9492
      @owlmostdead9492 9 หลายเดือนก่อน +1

      @@ark_knight It's also worse in quality, size/quality efficiency is pretty much still CPU software > .. > Intel > Nvidia > AMD > .. > Apple

    • @ark_knight
      @ark_knight 9 หลายเดือนก่อน

      @@owlmostdead9492H.265 encoders are pretty close to Nvidia and sizes are smaller.
      And then they have xilinx (I think xilinx?) fpga powered encoders which are superior than even Intels - though its not for consumers yet.

    • @bits360wastaken
      @bits360wastaken 8 หลายเดือนก่อน

      Source?

    • @owlmostdead9492
      @owlmostdead9492 8 หลายเดือนก่อน

      @@bits360wastaken numerous AV forums and I tested it myself.
      I used Netflix's quality evaluation algorithm VMAF on the test footage.
      I tested encoding of a 2020 Intel MacBook, 2021 Apple silicon M1 Max MacBook, 12th Gen Intel iGPU (Thinkpad X13) and nvenc on a 2080Ti.
      The VMAF score was largely also representative of the visual difference in quality I could perceive with my own eyes, so I'm not just quoting numbers here.

  • @dand4485
    @dand4485 9 หลายเดือนก่อน +2

    I wonder one more optimization... Have a lookup table/hashmap for all temps? If you have a hash map of [strings][float] for example allTemps["-99.9"][-99.9], no parsing and a straight hash lookup... Should be much faster then parsing anything...? Assuming the data is value for the temp values should be perfect to get some perf :)

  • @IshaanBahal
    @IshaanBahal 9 หลายเดือนก่อน +16

    9:07 I'd personally have all routines process chunky reads from file, and return maps to merge using the coordinator routine (main), so that way i won't have to do any mutexes. Might cause a bottleneck on the merge part, but the row processing would be fast enough in threads but might be similar to mutex waits. Actually, i kinda wanna write both and find out now.

    • @ark_knight
      @ark_knight 9 หลายเดือนก่อน +2

      DO IT!

    • @IshaanBahal
      @IshaanBahal 9 หลายเดือนก่อน +2

      @@ark_knight alright. Shall update here.

    • @anonymousanon4822
      @anonymousanon4822 9 หลายเดือนก่อน

      ​@@IshaanBahal Comment so I get a notification

  • @Exilum
    @Exilum 9 หลายเดือนก่อน +1

    People probably pointed it out, but the rounding is indeed ceiling as you figured out. Negative numbers, however, need to be rounded in the same direction, so there's some level of complexity to that. You can't /10*10 the integer like some people thought unless goland has a different opinion than most other programming languages. I couldn't find the specifics for goland, but integer division should point to 0, so negative numbers wouldn't be right.

  • @ScarabaeusSacer435
    @ScarabaeusSacer435 9 หลายเดือนก่อน +2

    This is another case where go isn't really any faster than java-- although it does have better memory consumption-- and you give up so much power from being able to use other JVM languages like Scala and Clojure, not to mention all the third-party library support.

  • @chriss3404
    @chriss3404 9 หลายเดือนก่อน +3

    Gonna be honest, it hurt to immediately realize that map access was going to be a problem late game, and then learn at the end that they weren't going to do anything with it.
    This BRC thing is a really fun challenge!

  • @ycombinator765
    @ycombinator765 9 หลายเดือนก่อน +3

    7:43: "I always reduce my loads when I can ... damn!"
    this is what i am here for, every single fkin day!

  • @sjoervanderploeg4340
    @sjoervanderploeg4340 3 หลายเดือนก่อน

    It depends on your storage medium, 512k is the "legacy" standard and 4096k is the "new" standard sector size! You want to read an entire block of data at once so multiples of sector size!

  • @_fakedub
    @_fakedub 9 หลายเดือนก่อน +3

    always bamboozeled by all the technical shit that goes on in this channel as im super new to coding but this one i swear i understood. great article

  • @JArielALamus
    @JArielALamus 9 หลายเดือนก่อน +5

    17:46 wait a second, the rules states the rounding should be done in the output, not the input. Converting to int64 will cause a loss in precision and increase a lot the rounding error of the final mean value.
    Since all the values have only one fractional digit, why did the author not store this additional decimal value to do some fixed point arithmetic to reduce the rounding error?

    • @SmartassEyebrows
      @SmartassEyebrows 9 หลายเดือนก่อน +7

      Even easier than that. Just take the numbers as whole numbers by ignoring the decimal (so 99.8 -> 998), do the sum, and then divide by 10 at the end to return to a Float with a single decimal, divide by count to get average, then round. Very straight forward with only two divisions and one round per station. Might be able to avoid that divide by 10 even with more clever manipulations to get down to just the divide by count (average) and round.

    • @SimGunther
      @SimGunther 9 หลายเดือนก่อน +1

      ​​​@@SmartassEyebrows how can you tell the difference between 9.98 and 99.8 with your scheme?
      Checkmate!
      Edit: I totally forgot to read the rule being "1 exact fractional digit" for each value at 10:28 🤦‍♀️

    • @JohanIdstam
      @JohanIdstam 8 หลายเดือนก่อน +2

      @@SimGuntherThe rules says there is always one decimal digit in the dataset.

  • @earthling_parth
    @earthling_parth 9 หลายเดือนก่อน

    I would love if this became a mini-series on yt from your streams on your main channel or on one of your secondary channels. This would be immensely helpful to noobs like me on Go and just programming in general ❤

  • @tr7zw
    @tr7zw 9 หลายเดือนก่อน +12

    Makes it even more insane that the fastest Java solution is 00:01.535 seconds with GraalVM(Compiling to native), and 00:01.880 seconds on any JVM.

    • @andrebrait
      @andrebrait 9 หลายเดือนก่อน +4

      Yes, but I've read it and it does a lot of low-level stuff you'd normally not do in Java. It literally reads memory directly with Unsafe and then uses the fact the underlying architecture is little endian to parse integers without branching.

    • @Gusto20000
      @Gusto20000 9 หลายเดือนก่อน

      Yeah, on 32core (64 threads) AMD Epic AND “data set contains 413 distinct weather stations”, not 10K stations. My Golang solution is 27sec on a laptop

    • @FilipCordas
      @FilipCordas 9 หลายเดือนก่อน +4

      Dotnet AOT at 1.197s and Jit 1.296

    • @deado7282
      @deado7282 9 หลายเดือนก่อน +1

      U would be able to that in C# as well, but at this point u abandone everything that makes these languaged. Just use C/C++ before u end up doing C/C++ within another language.

    • @nafakirabratmu
      @nafakirabratmu 9 หลายเดือนก่อน +1

      The data is also preloaded on a ramdisk

  • @Laminad18
    @Laminad18 3 หลายเดือนก่อน +1

    Imagine trying to solve this recursively. The poor call stack would be weeping

  • @rich1051414
    @rich1051414 9 หลายเดือนก่อน +2

    Goroutines are coroutines, only in go. Coroutines are a fancy way of saying threads, except they are managed.

  • @macaroni_italic
    @macaroni_italic 9 หลายเดือนก่อน +2

    Apple Silicon is significantly more energy efficient for basically every purpose. People can hate on Apple all they want, but the fact is that they are doing excellent work with their new ARM chips.

    • @LtdJorge
      @LtdJorge 9 หลายเดือนก่อน +1

      Not to detract from them, but part of their efficiency advantage is that they are always at least one litography node ahead.

  • @AtariWow
    @AtariWow 9 หลายเดือนก่อน +1

    Now I want to see people do this in all sorts of languages and see just what comes of it. I also want to try this myself in Go and if I was still well versed in Python which I've given up for Go, maybe I'd try that and suffer.

  • @devcoffee
    @devcoffee 9 หลายเดือนก่อน +4

    The Sam Bankman-Fried prison photo is adorable

  • @spottedmahn
    @spottedmahn 9 หลายเดือนก่อน

    5:00 Challenge accepted 😜! After many humans solve the 1BRC wouldn’t one be able to train a model & then it would output decent code? Possibly something the humans didn’t think of? 🤔

  • @cfhay
    @cfhay 8 หลายเดือนก่อน

    I never tried to solve this myself, but my first thought was to MMAP the file and access the bytes like it was memory? Then don't need to think hard how to optimize reading file chunks into buffer. I don't know where this would end up though performance wise.

  • @BAXEN
    @BAXEN 8 หลายเดือนก่อน

    In java, I never used scanner. I only used file reader and processed data manually. I always found it to be quicker, and as a sanity check I had to know what the data being read was and how to handle reading it

  • @goaserer
    @goaserer 9 หลายเดือนก่อน +1

    A fun exercise, even though I'd probably not go as far as to write my own string to int parser

  • @aidanbrumsickle
    @aidanbrumsickle 9 หลายเดือนก่อน

    My current computer only has 8 gigs of RAM, so i guess i'd have to use some cloud VM or something to really test this, but it does sound fun. I'd love to try this in Go and Zig. I wonder how much the differences in SIMD instructions between intel and arm would impact the results.

  • @garagedoorvideos
    @garagedoorvideos 9 หลายเดือนก่อน +2

    made great sence in the process of solving the 1BRC. Bravo

  • @juanitoMint
    @juanitoMint 8 หลายเดือนก่อน

    I used to process credit card operations dumped by banks and used some of the optimizations showed here but learned a couple more from this vid 💪

  • @AlexPaluzzi
    @AlexPaluzzi 9 หลายเดือนก่อน +11

    I am loving the Go content so much.

    • @nyahhbinghi
      @nyahhbinghi 9 หลายเดือนก่อน

      Prime wants to do Go next, but that's a mistake. Golang itself was a mistake.

    • @AlexPaluzzi
      @AlexPaluzzi 9 หลายเดือนก่อน

      @@nyahhbinghi you’re a mistake

    • @CatFace8885
      @CatFace8885 6 หลายเดือนก่อน

      @@nyahhbinghi Why so? Go seems like a beautifully simple language.

  • @TheApeMachine
    @TheApeMachine 9 หลายเดือนก่อน

    "Creating a new array might seem easier" true, but then you need to understand when and why sync.Pool is so much faster. Not at the end of the video yet, but my guess is a sync.Pool of pre-warmed goroutines is going to play a big part in a truly performant solution.

  • @ame7165
    @ame7165 8 หลายเดือนก่อน +1

    i got python down to 3.3 seconds on a 16gb m2 macbook pro. i don't think i can get below 1 second like the java bros without resorting to rewriting it for compiling. vanilla single threaded python using csv module to iterate it a row at a time took a little over ten minutes, so there's a lot of improvement to be had from that starting point 😆

    • @ClaytonTownley
      @ClaytonTownley 2 หลายเดือนก่อน +1

      Code or it didn't happen.

  • @microcolonel
    @microcolonel 8 หลายเดือนก่อน

    The real way to solve this problem is to fix the stupid file format. Register each station to an integer identifier, and each temperature to an integer value, and you can mmap the whole file and move slices to any number of threads with no communication. You could even put the whole record in 4 bytes easily, making it trivial to extract the temperature and accumulate the stats with SIMD or vector machines.
    There is a way to do the mmap split without fixing the format, but it's a bit messy.

  • @v2ike6udik
    @v2ike6udik 9 หลายเดือนก่อน +4

    Seal of approval

    • @f.f.s.d.o.a.7294
      @f.f.s.d.o.a.7294 2 หลายเดือนก่อน

      Arf arf arf

    • @v2ike6udik
      @v2ike6udik 2 หลายเดือนก่อน +1

      @@f.f.s.d.o.a.7294
      slaps flippers against the swollen tummy fast that is full on yak´s blood, wooly bully yak`s blood

  • @Amy-601
    @Amy-601 9 หลายเดือนก่อน

    Exactly my first thought ( for first take/ baseline) min and max! Storing map/ array takes space- one 1️⃣ bil ?? Hmm 🤔! Nope! Also split the rows and read. Aggregate all max/ min from split data. And then we improv… but for first take baseline brute force , that’s what I’d do! 7:47 Number 3, classic producer/ consumer with the wait() and notify() !! Who’d have “ thunk” it? Lol 😂 12:57 nice 👍 Also wondering after processing, if he did a flush() of the buf channel / stream. Yeah I’d increase it to 1000 from 100..… 21:15 Now he’s talking my language- chunk read ( above) and min/ max.. lol 😂 yep 👍 agree 👍- simple incremental optimization- nothing broken. That’s how you’d want to introduce change in production also. Or you’d have 1 billion customers clamoring at your door 🚪!! 24:16 Except I’d have shot for min/ max in step1. But that’s just me! Lol 😂 GO is a lot like Java with the ReadBytes lol 😂.. Good 😊 article/ video ! Yay 😀! - Amy ( aside: TreeMap in Java is sorted by keys 🔑 - If the temp was the key per city… hmm 🤔 29:21 )

  • @TrebleWing
    @TrebleWing 4 หลายเดือนก่อน

    on what planet does rounding towards positive yield -1.5==1?
    the closest integer value towards +infinity is -1

  • @microcolonel
    @microcolonel 8 หลายเดือนก่อน

    Wonder why they specify float rounding for something that fits losslessly in 11 bits of signed integer.

  • @gregoryolenovich6440
    @gregoryolenovich6440 2 หลายเดือนก่อน

    Thank you twitch chat. I was actually triggered when he made a seal sound instead of saying kissed by a rose.

  • @7th_CAV_Trooper
    @7th_CAV_Trooper 9 หลายเดือนก่อน

    Prime just invented the DynamoDB sharding model. It's not a surprise that switching to int was faster. I just am not sure how you get there quickly. Read left of dot, multiply by 10, add right of dot? 1 billion times? I don't know.

  • @Kane0123
    @Kane0123 9 หลายเดือนก่อน +1

    I’d have it output random values and just claim it was a hallucination if wrong.

  • @G.Aaron.Fisher
    @G.Aaron.Fisher 9 หลายเดือนก่อน

    I swear that no matter how long I'm in this profession I'm always going to read Go as the board game rather than the language. Somehow I thought this might be about a computer Go tournament on a 1 billion by 19 board.

  • @__gadonk__
    @__gadonk__ 7 หลายเดือนก่อน

    2:31 Hamburg Mentioned! RAAAAAH 🔥🔥🔥🔥🔥🔥 Es gibt kein schlechtes Wetter nur falsche Kleidung!!!

  • @MartinCharles
    @MartinCharles 4 หลายเดือนก่อน

    This approach is scientific, but the writer made the biggest gaff off all (and prime too), optimizing something where there is no bottleneck. I'm guessing the parsing is the bottleneck not the aggregate computation so putting a queue or any sort of fanout in front of the aggregation probably just introduces overhead. I really doubt the overhead of the go routine is less than adding a few numbers together.

  • @88Nieznany88
    @88Nieznany88 9 หลายเดือนก่อน

    I think trie was only for identifiers and inside we would've kept the same 4 values as in map.

  • @mad_t
    @mad_t 9 หลายเดือนก่อน +1

    Java made it in 1.53sec using a 32core EPYC monster. M1 pro 10C/16G is nowhere near that beast :)
    So we can't even compare the results

  • @chezzy6366
    @chezzy6366 25 วันที่ผ่านมา

    15:39 How do you make those performance graphs?

  • @JensRoland
    @JensRoland 9 หลายเดือนก่อน

    “Kiss From A Billion Rows” - Ceil

  • @PristinePerceptions
    @PristinePerceptions 9 หลายเดือนก่อน

    5:26 You're not a General AI. You're Captain AI. 😎

  • @wolfgangsanyer3544
    @wolfgangsanyer3544 9 หลายเดือนก่อน +1

    @theprimeagen if you go full time content creation, your first project HAS to be automating turning off alerts while streaming.

  • @wannabelikegzus
    @wannabelikegzus 8 หลายเดือนก่อน

    Haha, so this is why this project was trending when I was looking around at golang projects.

  • @tobiasnickel3750
    @tobiasnickel3750 9 หลายเดือนก่อน

    one question: isn't reading all the data from disk or even SSD taking already much more time then these little computations for each line? I mean everything has to go through the sata cable.

    • @nikolaoslibero
      @nikolaoslibero 9 หลายเดือนก่อน +1

      Could just put it all into RAM first.

    • @tobiasnickel3750
      @tobiasnickel3750 9 หลายเดือนก่อน

      @@nikolaoslibero all right, if that is by the rules,...

  • @richcole157
    @richcole157 9 หลายเดือนก่อน

    All these conditions, no attempt to account for file cache on an iOS bound problem.

  • @nexovec
    @nexovec 8 หลายเดือนก่อน

    Is this some insanely fast lockless tree implementation competition disguised as a file parser kind of exercise?

  • @test-rj2vl
    @test-rj2vl 9 หลายเดือนก่อน

    Would have wanted to see you code this challenge in this video.

  • @ArturdeSousaRocha
    @ArturdeSousaRocha 5 หลายเดือนก่อน

    24:13 Yeah, I love that this optimization is not rocket science.

  • @alkamino
    @alkamino 6 หลายเดือนก่อน

    WELCOME TO COSTO I LOVE YOU

  • @eric-seastrand
    @eric-seastrand 9 หลายเดือนก่อน +1

    I want to see somebody do this in browser JavaScript. Spawn background processes with service workers and such.

    • @ycombinator765
      @ycombinator765 9 หลายเดือนก่อน +2

      you naughty naughty

  • @spottedmahn
    @spottedmahn 9 หลายเดือนก่อน

    “Seal” 😂 2:10

  • @Zekian
    @Zekian 9 หลายเดือนก่อน

    Should probably memory map the file and delegate file reading to the OS.

  • @abhishek_k7
    @abhishek_k7 9 หลายเดือนก่อน

    noice way to learn optimization

  • @digibrett
    @digibrett 9 หลายเดือนก่อน +1

    I also thought of Batman Forever.

  • @Soromeister
    @Soromeister 9 หลายเดือนก่อน +1

    Just use AWK for this.

  • @ersetzbar.
    @ersetzbar. 9 หลายเดือนก่อน

    I thought I was the only one doing the seal thing

  • @johnathanrhoades7751
    @johnathanrhoades7751 4 หลายเดือนก่อน

    For those who come later, a trie is a different data structure than a tree.

  • @uuu12343
    @uuu12343 9 หลายเดือนก่อน

    Oh hey, its the picture of SBF in prison

  • @kouoshi
    @kouoshi 9 หลายเดือนก่อน +1

    00:02 Kek "First photo of Sam Bankman-Fried in jail"

  • @ThameemAbbas
    @ThameemAbbas 8 หลายเดือนก่อน

    Unless I am trying to show a poorer baseline, my thought process was to aggregate them per key.

  • @heartthymes
    @heartthymes 9 หลายเดือนก่อน

    I guess he could also pass pointers to channel instead of objects.

  • @architbhonsle7356
    @architbhonsle7356 9 หลายเดือนก่อน +4

    You can do a map reduce on this right?

    • @HjalmarEkengren
      @HjalmarEkengren 9 หลายเดือนก่อน

      Yes all fast implementations do a variation of that. One of the hard(ish) things to do is to quickly divide the file into roughly same sized chunks so that you can do the map operation in parallel.

    • @architbhonsle7356
      @architbhonsle7356 9 หลายเดือนก่อน

      @@HjalmarEkengren hmm. Could you open the file, seek ahead an estimated number of bytes (total bytes / n) then go forward till a newline and then jump again. Seems like a fun challenge.

    • @HjalmarEkengren
      @HjalmarEkengren 9 หลายเดือนก่อน

      ​@@architbhonsle7356 That's the trick i think most use. They fast forward total_bytes/no_threads, scan til the first newline and store a pointer, then repeat. The challenge truly is fun. none of the tricks used are hard and feel obvious in hindsight but putting it all together gets some pretty interesting results. As well as demonstrating how far from a computers potential the code we usually write is. And something being IO bound is a myth/BS :D

  • @complexity5545
    @complexity5545 9 หลายเดือนก่อน

    Start bringing back the obfuscation challenges...lol...

  • @stefdevs
    @stefdevs 9 หลายเดือนก่อน +1

    I think she resorted to threading way too early.

  • @Nocare89
    @Nocare89 9 หลายเดือนก่อน +2

    I had to pre-calculate distances between top 1000-ish cities on earth. I calculated this as 2-4 billion rows to be saved in sql.
    At first it looked like it was gonna take weeks running on a server. Then I put an sql table into memory instead of regular. Then it ran in less than like 2 hours with php.

    • @nyahhbinghi
      @nyahhbinghi 9 หลายเดือนก่อน

      1000 + 999 + 998 ... + 2 + 1 = 500,500.

    • @Nocare89
      @Nocare89 9 หลายเดือนก่อน +1

      @@nyahhbinghi I likely had both a-to-b and b-to-a rows. I know I had at least 2 billion rows. This was over 10 years ago. I don't have all the details squared away anymore lol. We also used some math shortcut to avoid comparing locations too nearby to each other. So maybe I remember the count of cities way wrong. Mostly I just remember manually fixing a ton of hebrew/arabic names that the db didn't play nice with.

    • @nyahhbinghi
      @nyahhbinghi 9 หลายเดือนก่อน

      @@Nocare89 nice, it's all good

  • @InfiniteQuest86
    @InfiniteQuest86 8 หลายเดือนก่อน

    Yeah I was pretty confused why he wasn't just mmaping the file in. Seems so slow to do buffered reading of any sort.

  • @FabulousFadz
    @FabulousFadz 9 หลายเดือนก่อน +2

    I see Golang, I click. Go is clickbait for me. Incidentally, I did this same challenge in Go. I'm running on an 8th Gen Intel i5 (Quadcore with 8 hardware threads) and 16GB RAM. I memory mapped the file, had a struct with Sum, Count, Min, and Max and on that I ended up getting around 13-15 seconds depending on which the way the wind was blowing. After using some go routines to work on each entry and report their results in a channel, I merged the maps together for the final result which I added to a minHeap in order to get the cities in alphabetic order. I came across this after the deadline and I held off from sharing my experience and findings until I can run it on hardware similar to what the test hardware was going to be. Other contributions were completing in under 5 seconds on that so I want to see what that will do to my times.

  • @MasamuneX
    @MasamuneX 9 หลายเดือนก่อน

    i ran into a problem just like this reading a financial dataset that was 9 gigs of data tldr i just chunked it

  • @styleisaweapon
    @styleisaweapon 9 หลายเดือนก่อน

    if your runtime isnt completely dominated by loading/parsing, you've messed up.

  • @alankewem
    @alankewem 9 หลายเดือนก่อน

    Search about "Rinha de backend" a more interesting challenge

  • @JoshPeterson
    @JoshPeterson 9 หลายเดือนก่อน

    Lovitz is spot on

  • @KillianTwew
    @KillianTwew 3 หลายเดือนก่อน +1

    14 seconds to release a billion swimmwrs is not great.

  • @Hamsters_Rage
    @Hamsters_Rage 8 หลายเดือนก่อน +2

    TLDR - he is doing nothing, just reading somebody's else readme file

  • @draakisback
    @draakisback 9 หลายเดือนก่อน +1

    This challenge is kind of dumb because it's very IO bound. Once you get to a certain point, you're really just going to be optimizing the IO and beyond that it's just very basic computation.

  • @zuowang5185
    @zuowang5185 9 หลายเดือนก่อน

    It's M3 now for work laptops

  • @huvineshrajendran6809
    @huvineshrajendran6809 9 หลายเดือนก่อน

    Yes, Mistral is a thing.

  • @seasong7655
    @seasong7655 9 หลายเดือนก่อน

    I guess in python you'd use something like pandas to read the file, but not sure

  • @lancemarchetti8673
    @lancemarchetti8673 9 หลายเดือนก่อน

    Ger rid of the byte that separates the rows and save a lot of time!

  • @yaksher
    @yaksher 9 หลายเดือนก่อน

    I know this is kinda cheating because I used libraries, but my first attempt at a fast Rust solution (after implementing a baseline single-threaded solution) using rayon's parallel iterators and just mmaping the entire file to treat it as one string slice to avoid any copying and then using a hash map per thread to aggregate results per thread which are then merged resulted in a time of like 5 seconds on my laptop lmao