Zip works Better than AI

แชร์
ฝัง

ความคิดเห็น • 159

  • @meatcow417
    @meatcow417 10 หลายเดือนก่อน +278

    This is hilarious because my teachers back in college used this exact formula with Gzip to calculate plagiarism between programming assignments.
    Guess some people knew that Gzip could do this 10 years ago, and we just forgot it worked.

    • @rigen97
      @rigen97 10 หลายเดือนก่อน +36

      I hope your teachers read this paper and collectively facepalms because they actually just didn't realize it's something they could've published.

    • @ludoviclagouardette7020
      @ludoviclagouardette7020 10 หลายเดือนก่อน +21

      @@rigen97 As a teacher I am internally facepalming, I am terribly and deeply saddened by not having published that first

    • @AlgoFodder
      @AlgoFodder 10 หลายเดือนก่อน

      The underlying idea goes back at least a couple of decades. You can google the paper: "Language Trees and Zipping" by Dario Benedetto, Emanuele Caglioti, Vittorio Loreto, which appeared in Phys. Rev. Letters in 2001

    • @AvanaVana
      @AvanaVana 9 หลายเดือนก่อน +11

      The point of the paper is not so much that compression algorithms _can_ be used to calculate the degree of identicalness of data, but rather that they can use that ability at scale to classify large sets of data-moreover, in this task _they outperform complicated neural networks_ with millions of parameters, purpose-built to do this. Part of the reason it never occurred to your teachers to publish on this 10 years ago is that 10 years ago, “AI” wasn’t yet the buzzword/hot-topic that it is today, garnering so much hype, focus, and $$$. In the current climate, taking any simple idea can become a novel research paper or alluring pitch in Silicon Valley if it is recontextualized and repackaged (given that it is actually useful) as part of the burgeoning “AI” dialogue/industry.

  • @Geosquare8128
    @Geosquare8128 10 หลายเดือนก่อน +118

    this is my favorite genre of cs research right now, the fact that compression is a useful clustering metric is so funny but also makes so much sense.
    stuff like this makes me hopeful that one day we'll escape the "scale everything" model of ML research and lean into the information/learning theory side of AI research :))

    • @billowen3285
      @billowen3285 10 หลายเดือนก่อน +4

      Didn't expect to see you here but I guess it checks out

    • @Geosquare8128
      @Geosquare8128 10 หลายเดือนก่อน +9

      lol yeah this is what I do most now

    • @shivashankar28
      @shivashankar28 10 หลายเดือนก่อน +3

      hey, I am currently doing some readings on compression algorithms for embedded systems like STM32, nRF52 microcontrollers. Do you have any suggestions please? my current contenders are miniz, Heatshrink. But I want to have an expert weigh in or at least point some direction :))

    • @andrewferguson6901
      @andrewferguson6901 9 หลายเดือนก่อน +1

      some would even go as far as saying that compression and intelligence, at the extreme, are nearly indistinguishable

    • @BlackDoggo-dy6lp
      @BlackDoggo-dy6lp 9 หลายเดือนก่อน

      are you tired of people saying "oh look its the dream guy"

  • @viacheslavprokopev8192
    @viacheslavprokopev8192 10 หลายเดือนก่อน +50

    Yes, it works because if X and Y are similar (have a bunch of words in common), their normalised compressed length will be small and if two texts are different, it will be large. Think about it as "more common words = smaller compressed representation".

    • @OCPyrit
      @OCPyrit 10 หลายเดือนก่อน +3

      Yes and I think not only common words, but also things like similar overall structure etc. This metainformation is derived from the raw data. And category is also such a metainformation as well.

  • @Alexander_Sannikov
    @Alexander_Sannikov 10 หลายเดือนก่อน +96

    Computing Kolmogorov's compelxity is not NP-hard. No, it's literally uncomputable. Not in polynomial, not in exponential, it's just not computable in any finite time. Wikipedia has more details as to why, but TL;DR you can't even bruteforce iterate through all possible programs, because some of them will never terminate and just figuring out whether a given program will terminate is already an uncomputable problem.

    • @JonasDonald
      @JonasDonald 10 หลายเดือนก่อน +16

      True, good point! Just a minor nitpick: It still is technically NP-hard, since NP-hard just means, roughly speaking, "harder or as hard as" the "hardest" problem in NP (with regard to computational complexity)

    • @NXTangl
      @NXTangl 10 หลายเดือนก่อน +1

      Also, there's a pretty easy argument even without getting into the halting problem, which is that if you can compute Kolgomorov complexity with a finite program, there exists some program with length n which finds the smallest number with Kolgomorov complexity greater than n and outputs it.

    • @AmaroqStarwind
      @AmaroqStarwind 10 หลายเดือนก่อน

      What if you used a quantum computer?

    • @NXTangl
      @NXTangl 10 หลายเดือนก่อน +3

      @@AmaroqStarwind a quantum computer can be simulated by a classical computer with exponential overhead. So no. However, the decision version of the problem, asking whether Complexity(str)

    • @andrejjjj2008
      @andrejjjj2008 9 หลายเดือนก่อน +1

      Can you explain to an idiot:
      Why you can’t get the length of the program, that produces a peace of text?
      Why it’s uncomputable?
      I broke my brain reading the definition like a dozen times at this point.

  • @joshuascholar3220
    @joshuascholar3220 10 หลายเดือนก่อน +4

    One point is that one of the reasons this works is THE KIND of compression algorithm that gzip is.
    There are programs that compress text BETTER than gzip does WHICH WOULD BE WORSE for measuring similarity between texts, because they're better at taking advantage of global properties of the text (for instance, recognizing the languages) and less focused on taking advantage of context within the text.

  • @stephaneduhamel7706
    @stephaneduhamel7706 10 หลายเดือนก่อน +21

    Having a value of 2 for k is strange. It's going to cause lots of cases where two classes have the same score... Ideally, it shouldn't be a multiple of either 2 (or 4) or 3 to avoid these kinds of conflicts.
    Personally, I would pick some prime number bigger than the number of classes, for example 5

    • @azai.mp4
      @azai.mp4 10 หลายเดือนก่อน +2

      And then if you get a sample like (A, A, B, B, C), where A and B are tied, you can rerun the k-nearest-neighbors but only considering neighbors in classes A or B. And you'll get a new sample like (A, A, A, B, B) for example, where you can now pick A as the "winner".
      As long as you use a prime number larger than the number of classes, this always reduces the number of classes under consideration, so if you repeat the process enough times, you'll always reach a winner.

  • @mindasb
    @mindasb 10 หลายเดือนก่อน +14

    Knowledge is compression. Your mind creates a model in your head of the world that is a compressed version of the world.

    • @anon_y_mousse
      @anon_y_mousse 10 หลายเดือนก่อน +4

      @@theboomerperspective2478 Sure it does, otherwise you'd have perfect recall.

    • @EvilTim1911
      @EvilTim1911 10 หลายเดือนก่อน +3

      And it happens to be a very lossy compression

    • @anon_y_mousse
      @anon_y_mousse 10 หลายเดือนก่อน

      @@belgianwaffles197 I'm not sure why you brought that up, or how you think any scientist knows that about chimps, but okay.

    • @anon_y_mousse
      @anon_y_mousse 10 หลายเดือนก่อน

      @@belgianwaffles197 Again, I'm asking you, but in a different way, relevance?

    • @anon_y_mousse
      @anon_y_mousse 10 หลายเดือนก่อน

      @@belgianwaffles197 Ah nice, so you take offense at everything. Got it. What you said was neither correct nor relevant. And I still don't know why so many of you have this weird brain disease. But thanks for helping me feel less inclined to be sad when humanity goes extinct in the next 100 years. I won't even miss half of you.

  • @lizardy2867
    @lizardy2867 10 หลายเดือนก่อน

    I am glad I came across your channel today. I am currently in research and development on a project which attempts to model Kolcom. I hope to see more of your journey as I go along with mine.
    One such approach I have decided to utilize, is compression by reference. Pointers are one implementation of this method. A word for example is comprised of letters, which are hex values, or a single byte value. A reference frame determines the level of compression possible. This is why GZip and K-NN are very successful. By caching the reference frame, you cache the reverse program to decompress the data.

  • @anon_y_mousse
    @anon_y_mousse 10 หลายเดือนก่อน +3

    I agree that Pascal is the only real potential substitute for C, and you mentioning that reminded me that Niklaus Wirth died on the first.

  • @OREYG
    @OREYG 10 หลายเดือนก่อน +1

    43:20 - Pascal WAS a competitor to C, and it almost won, it didn't win because:
    a) arrays start from 1
    b) lack of macros (it was considered as uber-feature at the time)
    c) C was available out-of-the-box in Unix
    Zig is a good candidate to fill C niche because it DOES focus on simplicity, but it wouldn't take off because its too focused on it's own ideal (not until there would be c-style for loop, and some of 'errors' wouldn't be categorized as warnings instead (for example unused variable is an error))
    44:00 it doesn't matter because fread (and OS) does caching for you, and realloc takes some extra capacity

  • @marvos420
    @marvos420 10 หลายเดือนก่อน +15

    Tsoding, I really love your content dude, so genuine and interesting, you're really inspiring!
    I've wanted to ask you if you're into signal processing and if yes, would be cool to see you do a nice tsoding session about Fourier transformation and even wavelets and how they're similar yet very different they're in the sense of how they give information about both time and frequency :D You could even demonstrate how wavelets can be used to analyze features in pictures like blood vessels in medical imaging and so on!
    I'd really find that interesting coming from you and the way you explain and talk about complex things in an understandable way.
    Much appreciated and continue doing what you're doing!

    • @adriaanj
      @adriaanj 10 หลายเดือนก่อน

      He already did some video's on this for musializer iirc

  • @Mozartenhimer
    @Mozartenhimer 10 หลายเดือนก่อน +1

    Regarding annoyance with custom printers, you can give them your compier's printf attribute so you get nice warnings.

  • @cobbcoding
    @cobbcoding 10 หลายเดือนก่อน +9

    1:11:45 read all the data into memory and compressed it faster than github can load a single page

  • @hr3nk
    @hr3nk 10 หลายเดือนก่อน +25

    WARNING, I have to warn anyone who believes in the papers description of this approach being better then AI - it's not. This papers code has been debunked and authors made critical mistake with metric calculations when comparing their method against BERT. Don't know if this is mentioned on stream, but felt like this is important.

    • @chibi_bb9642
      @chibi_bb9642 10 หลายเดือนก่อน +11

      source?

    • @droidBasher
      @droidBasher 10 หลายเดือนก่อน

      @@chibi_bb9642
      Ken Schutte has a summary on the blog, search for "gzip-knn-paper2 kenshutte" for a link to it.
      In summary, K=2 is weird, it causes lots of ties and a flaw in the paper's code causes ties to resolve toward true labels. Also there may be train/test data contamination.

    • @cat47
      @cat47 9 หลายเดือนก่อน +4

      with updated metric, it is slightly outperformed by BERT

    • @manawa3832
      @manawa3832 9 หลายเดือนก่อน

      hello source?

    • @droidBasher
      @droidBasher 9 หลายเดือนก่อน +1

      I've provided source a few times, but my message gets deleted. A google search will easily find flaws in the gzip classifier paper.

  • @wernersmidt3298
    @wernersmidt3298 10 หลายเดือนก่อน +19

    This paper is a beautiful troll. I love how you think about how something works and turning the paper into a sort of puzzle. I'm stealing this technique without shame 😂
    "Why the fuck would this even work?" is a question that really needs to be asked more by deep learning cool aid kids. (let's be honest - we use these models when we lack the capacity to explain whatever system we're..."modelling")

  • @MatejDujava
    @MatejDujava 10 หลายเดือนก่อน +3

    When k=2 and neighbors are different, selecting predicted klass will be pick based on order of classes (which I think is the problem). so I would say k=nklass+1 should be better

  • @顔boom
    @顔boom 10 หลายเดือนก่อน +1

    Just a stream of thoughts...
    Given the formula: NCD(x,y) = (C(xy) - min(C(x), C(y)) / max(C(x) , C(y))
    I think it's quite educational to look at the minimal test case of samples: "aabb", "aa", "bbcc"
    Assuming a naive length based on dictionary size 2, entries D and dictionary lookups L
    "aa", "aabb" => (2D3L - min(1D1L, 2D2L)) / max(1D1L, 2D2L) = 1D2L / 2D2L
    "aa", "bbcc" => (3D3L - min(1D1L, 2D2L)) / max(1D1L, 2D2L) = 2D2L / 2D2L
    "aabb", "bbcc" => (3D4L - min(2D2L, 2D2L)) / max(2D2L, 2D2L) = 1D2L / 2D2L
    So, as expected, "aa" has as much in common with "aabb" as "aabb" does "bbcc". Of course, it's far from the complete picture but decent enough to illustrate a point.
    Not sure how useful it would be, but there's a parallel gzip called pigz. If nothing else, the concatenation process might be useful. You could store the compressed form of every entry and then, instead of doing C(xy) it might be enough to approximate by compressing y initialized with the dictionary from x. So halving the workload for the calculation for the given samples at least.
    k = 2 is probably used due to the short samples causing too much variability for larger neighbourhoods I would think?
    The moon landing being classified as sports isn't all that strange. At least I'm pretty sure I've heard such phrases as "landing (a goal)", "landing (a hit)", etc.
    Would be interesting if this also works for image classification, or maybe even face recognition, using jpeg.

  • @cyanmargh
    @cyanmargh 10 หลายเดือนก่อน +2

    Calculation of kolmogorov complexity is not only NP-hard, it's actually uncomputable (and even wikipedia contains proof for it)

  • @thegreywun
    @thegreywun 10 หลายเดือนก่อน +1

    you could get a large speed-up by caching the compressed samples - as you will only need to re-compress the combination of test+sample inside the loop, instead of all three each time

  • @pastenml
    @pastenml 10 หลายเดือนก่อน +4

    CPU cycles have never been utilized more efficiently. Take that Sam Altman!

  • @CalebWetherell
    @CalebWetherell 10 หลายเดือนก่อน

    To get an intuition for setting k, I would look at the sorted NCD table for the first n records. If the klassifier is good, the klasses for the examples at the top of the sorted NCD table should be often aligned with the test example.

  • @Raspredval1337
    @Raspredval1337 10 หลายเดือนก่อน +4

    44:30 stdio already reads in chunks. So it kinda reads a chunk then copies it into users read buffer. And yes it is inefficient, you could probably do it better with raw 'read' syscalls

  • @byt3blitz
    @byt3blitz 10 หลายเดือนก่อน +2

    24:21 One of your machine learning videos a while back inspired me to recreate a Kaggle contest. Not sure if it’s considered a contest but more of a preliminary challenge before moving to some of the more difficult ones. It’s the titanic prediction model. I solved it using python and immediately thought, wow that was stupid. All the functions I used were completely abstracted and I had no clue what they were doing under the hood. As a means to learning, and an excuse to not use python, I decided to cause immense pain to myself and reimplement the model using c! I spent a decent amount of time writing a csv parser for the training data. The way the data was structured really pissed me off. nested double quotes and all sorts of crazy shit they packed in a given string lmfao. I ended up using a stack based approach, and it worked perfectly. I’m still working on building the rest of the model, it’s been challenging but learning a lot. It’s videos like these that inspire me to dig a bit deeper.

  • @JonnyDeRico
    @JonnyDeRico 10 หลายเดือนก่อน +5

    wow this is cool shit, i heard geohot talking about that ai is basically compression :D And here it is cleary demonstrated.

  • @annorome
    @annorome 10 หลายเดือนก่อน +1

    I love our little rant about mathematicians giving stuff overly technical, complex names to make things seems complex. Mathematicians are really good at abstracting things to easier problems for themselves and really good at making it as complex as possible for others.

  • @ecosta
    @ecosta 9 หลายเดือนก่อน

    I love how I get exposed to different papers by watching Tsoding. It feels like I'm back at college (but without the stress of tests, etc).

  • @tomasruzicka9835
    @tomasruzicka9835 10 หลายเดือนก่อน

    It's interesting. I started getting really good at learning fast recently. I think it has to do with my ability to find a good enough "compression algorithm" and then compress the topic I want to learn. And yes, I first have to understand the concept, but I didn't really had a problem with that at school. What I did have a problem with was to remember a semester of detailed literature verbatim (times the ammount of courses I attended that semester), so I could do well on exams. (Stupid really) But in the end the process became: Parse and index the issue for common concepts, generalize the differences as much as possible, remember the general concept once, remember the few variables to bend the concept to my will. Which, if you think about it, is like a compression algorithm. Find common subsets, encode them by "keywords", remember the common keywords once (like a dictionary), remember the variables (the keys into the dictionary)

  • @SuperFinGuy
    @SuperFinGuy 9 หลายเดือนก่อน

    It seems that basically data compression mimics cognitive abstraction or how we form concepts. We find find the similarities among objects in order to classify them. For example, we call something a chair because it meets a certain criteria.

  • @Blubb3rbub
    @Blubb3rbub 10 หลายเดือนก่อน +2

    Could maybe also speed up the calculation of C(ab) by feeding the test sample into a deflate stream and then making a full copy of the state of that deflate stream before feeding the other sample into it. And then copy that backup again to get back to the state where the test sample was already deflated.

  • @nephew_tom
    @nephew_tom 10 หลายเดือนก่อน +4

    Interestingly, when you say "state of the art AI language models and shit like that", youtube auto-generated subtitles do not show the "shit"...
    TH-cam guys implemented a shit-collector... 😅

  • @varshneydevansh
    @varshneydevansh 10 หลายเดือนก่อน +6

    I started smiling already by reading the title 💪🏻😁

  • @punitkaushik7862
    @punitkaushik7862 10 หลายเดือนก่อน +1

    I am working on a DL projects and it's 4 am in the morning and I am fucking dying because I don't have enough computation on kaggle to run the models properly ..... and TH-cam recommends me this. These recommendation systems are getting out of hand man .

  • @blackasthesky
    @blackasthesky 9 หลายเดือนก่อน

    What a hilarious paper, love it!

  • @buny0n
    @buny0n 10 หลายเดือนก่อน

    @00:12:31 they're code also contains syntax errors. (missing ')' after every use of ...`encoded()` ).

  • @titaniumtomato7247
    @titaniumtomato7247 10 หลายเดือนก่อน

    I guess with k = 2 its hard to show the "confidence" of the system but that would have been a cool addition so that you can evaluate the output better

  • @ossianarn9175
    @ossianarn9175 10 หลายเดือนก่อน

    You could try implementing the MDS algorithm using the NCD in order to plot the points to a lower dimnsion. Then you could use k-means and see if the data could be clustered correctly.

  • @gianlucalocri
    @gianlucalocri 9 หลายเดือนก่อน

    Hi! New subscriber here!
    You have one of the freaking best channel on YT. Your contents are beyond priceless!
    I'm speechless.
    BTW You really need a simple piece of software to display onscreen the number of "right" you say 😂. (no offense intended!)
    Greeting from Italy!

  • @diegoestrada35
    @diegoestrada35 10 หลายเดือนก่อน

    11:23 this remind me of a video I saw of george hotz, where he said exactly that, I think it was called "the days that changed my life" or something

  • @RandomGeometryDashStuff
    @RandomGeometryDashStuff 10 หลายเดือนก่อน +1

    22:46 but github also needs to display stuff, not just load into memory

  • @MystycCheez
    @MystycCheez 2 หลายเดือนก่อน

    Are names language's version of implementing (approximation of) Kolmogorov complexity?

  • @dnkreative
    @dnkreative 10 หลายเดือนก่อน

    It works because NNs a actually lossy compressors and with enough coefficients they approach losless compression state. And zip is just a losless compressor. That seems obvious.

    • @dnkreative
      @dnkreative 10 หลายเดือนก่อน

      Btw, this has nothing to do with "intelligence" or "understanding"

    • @DelgardAlven
      @DelgardAlven 10 หลายเดือนก่อน

      @@dnkreative true, and still this gzip raw demo by zoze catches sense kinda preciser than average people in the chat LOL

  • @ivanjermakov
    @ivanjermakov 10 หลายเดือนก่อน

    By the same analogy, the brain is an interpreter that uses body's sensory systems as IO. Fascinating!

  • @nil0bject
    @nil0bject 10 หลายเดือนก่อน

    PASCAL was a good competitor to C. I'm suprised people like us still exist. GG

  • @apppples
    @apppples 10 หลายเดือนก่อน

    try implementing hnsw for this to speed up your classification? probably be a lot of fun to implement

  • @dromedda6810
    @dromedda6810 10 หลายเดือนก่อน +2

    loved watching this live

  • @Pyronimous
    @Pyronimous 10 หลายเดือนก่อน +1

    That's freaking brilliant! I wonder if this would work for handwritten character recognition, like with the MNIST dataset or similar?

    • @hendrikd2113
      @hendrikd2113 10 หลายเดือนก่อน

      Bro, just google "mnist by gzip".
      The first result will tell you, that this is possible, but 78% accuracy is kinda mid."

    • @stephaneduhamel7706
      @stephaneduhamel7706 10 หลายเดือนก่อน +1

      Maybe using PNG instead of gzip?

  • @apppples
    @apppples 10 หลายเดือนก่อน

    another idea... could you preprocess the train data and find the centroid of each class, maybe 10 or so samples for class, that maximizes the distance from other classes while minimizing distance to other elements of their class, then classify on this reduced set? though at that point maybe you just get the most common words per each class... idk

  • @nyyakko
    @nyyakko 10 หลายเดือนก่อน +1

    @ 27:30 and this is why im always using -Werror in my projects 😅😅

  • @ruroruro
    @ruroruro 10 หลายเดือนก่อน

    7:58 it's not NP-hard, it requires solving the halting problem

  • @ianknowles
    @ianknowles 10 หลายเดือนก่อน

    Holy phek... Here's me being an idiot using embeddings. This makes total sense how it works. I'm only 10mins in and I totally want to stop and try and implement myself. I suspect for simple query and passage problems this would be useless but for large bodies of text it would work a charm.
    Gives me an idea to use a thesaurus and/or terminologies to create domain specific querying without AI model

  • @ThankYouESM
    @ThankYouESM 6 หลายเดือนก่อน

    I might have proof of being the first person to find this solution which was just to see if kNN can work in a zip compression. By the way... I am not at all a mathematician and I also might be the first to have replaced the Sigmoid to be the ReLu.

  • @SteveRowe
    @SteveRowe 10 หลายเดือนก่อน

    I'd try to reproduce their results with their data.

  • @dinoscheidt
    @dinoscheidt 10 หลายเดือนก่อน +1

    12:10 😀 me as AI Researcher: 100% needed to do worse to crush space to fit into max page length. We shall apologize to all engineers ❤️

  • @hoefkensj
    @hoefkensj 10 หลายเดือนก่อน

    difference between compression and ai is the same difference as ALAC/FLAC/WAVPAK and AAC,MP3.VORBIS, , the first is obligated to reproduce the source material exactly , the second tries to create a new thing that is persieved as alike the original. i recently had a discussion about this with someone about something related, i claimed that nvidia in essense proclamated that they hit the ceiling of their abilities by pushing all in on AI , since if you take a step back and look at the process , they switched from increasiong gpu power for improving image quality and rate , to prerendering the games on ultra quality at their server farm and shipping the frames the cards cant generrate due to lack of power , in a highly compressed form disguised as AI to the customer , if they just gave you hard disks full of complete prerendered frames , and your pc only had to pick the rigth frames to match the actions you make in the game , youd be calling it a disgrase... because the AI only can do what it was trained to do, similar a compressed prerender can only aproximate its original source , making the x-thousand ai accelerators pretty useless on their own for anything other than image restoration , and defenetly for anything requiring accuracy.... and im thinking of doing fourrier transforms en mass parralell (soemthing gfx cards should be good at,and they are but they wont get better than they are now) or proteinfolding , possibly even emulating syncro systems (logic) or simulating old console hardware (similar to what they use fpga's for today (modern gameboy eg)

  • @puzzlinggamedev
    @puzzlinggamedev 9 หลายเดือนก่อน

    Today I used this trick in the puzzle game I'm making :)

  • @matthias916
    @matthias916 10 หลายเดือนก่อน +1

    Is it possible to show the possibilities of the classes? or does it not calculate that

  • @thenwhoami
    @thenwhoami 7 หลายเดือนก่อน

    "The internet feels increasingly smaller and smaller and smaller..." yes :(

  • @SegevKlein
    @SegevKlein 9 หลายเดือนก่อน

    this video was awesome

  • @ToniMorton
    @ToniMorton 10 หลายเดือนก่อน

    so theoretically information compression is the emergence of awareness/understanding?

  • @guptadagger896
    @guptadagger896 10 หลายเดือนก่อน

    11:04 Is this true, does the ideal compression of a piece of text for example necessarily have the same entropy as the smallest program (in a specific language) to generate that text?

    • @guptadagger896
      @guptadagger896 10 หลายเดือนก่อน

      I think you'd have to take the minimum over all programming languages, at the very least. Do they all collapse to the same value, is the question

  • @RandomGeometryDashStuff
    @RandomGeometryDashStuff 10 หลายเดือนก่อน

    43:21 c is still easier for dynamic linking than pascal

  • @owlmostdead9492
    @owlmostdead9492 10 หลายเดือนก่อน

    23:00 thank JS for that

  • @RandomGeometryDashStuff
    @RandomGeometryDashStuff 10 หลายเดือนก่อน +3

    05:00 can NCD(x,y)!=NCD(y,x) happen?

    • @igricRBX
      @igricRBX 10 หลายเดือนก่อน

      By definition, no.

    • @leiko7986
      @leiko7986 10 หลายเดือนก่อน +2

      Well in practice yes lmao

    • @DylanNelsonSA
      @DylanNelsonSA 10 หลายเดือนก่อน +3

      @@igricRBX What do you mean? The definition has C(xy) in it, which depends on the order of xy. I would find it surprising if the smallest program that outputs xy always has the same length as the smallest program that outputs yx. And even if we forget about Kolmogorov complexity and just focus on gzip, I did an experiment where I concatenated two latex files from university assignments that happened to be on my laptop, and used gzip to compress the result and looked at the size, and then did the same but with the concatenation in the other order, and the order did matter. (Although the difference was only about 10 bytes)

    • @DylanNelsonSA
      @DylanNelsonSA 10 หลายเดือนก่อน +2

      We could make the definition symmetric by replacing C(xy) with 1/2 (C(xy) + G(yx)). I don't know if it really matters. I think knn is already sensitive to the order that you add new items.

    • @anon_y_mousse
      @anon_y_mousse 10 หลายเดือนก่อน +2

      @@DylanNelsonSA That's because of how it processes the data. The sliding window is only so large and if data repeats in certain ways it can compress better or worse. You can even generate data that can bork the processing time.

  • @ignaciovittorinifennema9791
    @ignaciovittorinifennema9791 10 หลายเดือนก่อน

    I wonder if you could make a basic spam/phishing filter using this

  • @platin2148
    @platin2148 10 หลายเดือนก่อน

    Hmm should take a look at Oodle i guess?

  • @RandomGeometryDashStuff
    @RandomGeometryDashStuff 10 หลายเดือนก่อน

    40:53 fread and ferror don't set errno, so it would log incorrect error message

  • @rigen97
    @rigen97 10 หลายเดือนก่อน

    tangent re text files it's so so so stupid how "hard" it is for text editors to open files in the megabytes range. it's a minuscule fractions of a fractions of RAM available in our modern systems, yet notepad++ for example lags so hard when you feed it files in the megabytes. while ed just chads through it.

  • @marcelbricman
    @marcelbricman 9 หลายเดือนก่อน

    it doesnt have anything to do with complexity. as you have also realized its about common parts of A and B. huffman coding will do the heavy lifting here, basically looking for common prefixes and reducing their occurrence.

  • @mythacker4365
    @mythacker4365 10 หลายเดือนก่อน +4

    What da thumbnail it is 😂😂😂

  • @h65119
    @h65119 10 หลายเดือนก่อน

    Maybe it wont matter but at 45:24 you were measuring the performance of the whole application (reading the file to memory and then counting the lines) so i thought maybe counting the lines takes more time that opening the file hence you cannot decide if reading the whole file into memory is faster than reading it in chunks of 32k bytes, unless maybe the OS is also reading the file into memory using 32k byte chunks ? or maybe the allocator allocates pages of 32k bytes at a time and handles the page faults ? IDK but what i'm saying is maybe the measurment was flawed, maybe i'm wrong i am just curious and please correct me if so

  • @flamendless
    @flamendless 10 หลายเดือนก่อน

    Parsing complex cvs is easier than signing up in kaggle it seems 😂

  • @sukina5066
    @sukina5066 10 หลายเดือนก่อน

    42:50 true, I used to be a fanboy and it hurts but it's true

  • @marusdod3685
    @marusdod3685 10 หลายเดือนก่อน

    couldn't you define a macro to automatically create an array type of a struct, something like
    #define vector(T)\
    typedef struct {\
    T* data;\
    uint16_t length,capacity;\
    } T ## s

  • @jgfjfgjfhjf
    @jgfjfgjfhjf 6 หลายเดือนก่อน

    science is basically reverse debugging: find out why something works

  • @SBcode-s3n
    @SBcode-s3n 10 หลายเดือนก่อน

    Hello
    I'want to learn opencv but l have i5 cpu 8gb ram and intel hd 520 gpu is it Enough

  • @wzqdhr
    @wzqdhr 10 หลายเดือนก่อน

    Wait for the day someone comes with AI idea called “hash is all you need” and it’s going to be more awesome than Cypto

  • @helloworld9018
    @helloworld9018 10 หลายเดือนก่อน

    Can someone clarify what are the main fields of math and cs this man is interested in?

  • @xd_metrix
    @xd_metrix 10 หลายเดือนก่อน +2

    Just a little reminder, if you divide the video into 120 parts, the division is useless because it doesn't tell a person anything. Perhaps it would be better to combine some parts and generalize

  • @PedroPabloCalvoMorcillo
    @PedroPabloCalvoMorcillo 10 หลายเดือนก่อน

    You're fucking awesome.

  • @elrichjr301
    @elrichjr301 10 หลายเดือนก่อน

    Here's a crackpot idea: AI is lossy compression for text?

  • @hubstrangers3450
    @hubstrangers3450 9 หลายเดือนก่อน

    Thank you....

  • @koktszfung
    @koktszfung 10 หลายเดือนก่อน +1

    Nice thumbnail

  • @sergarrick
    @sergarrick 10 หลายเดือนก่อน

    I can't help but notice that all these papers that detract from modern llms use models that pre-date gpt-3. Like, no one ever claimed OPT or text-davinci were exciting, revolutionary, or game changing. It wasn't until gpt-3/4, llama, mistral, yi, etc that llms started to feel magic. The ouroboros paper was the same thing. Finetuning OPT with OPT outputs makes for a dumber model. Finetuning llama with llama outputs makes a far smarter model. Pretty sure that's how wizardLM was trained. Likewise, I'm not surprised Gzip is smarter than BERT. Who is out here claiming BERT is the future?

  • @terakiei
    @terakiei 9 หลายเดือนก่อน

    Yeah but it spends much more time and processing

  • @blackhaze3856
    @blackhaze3856 10 หลายเดือนก่อน +1

    I though gzip used Huffman. Or the most probably thing is I didn't understand a shit xD

  • @paulpruenster5884
    @paulpruenster5884 10 หลายเดือนก่อน

    3:06 FACTS hahaha bothers me so much sometimes

    • @grivza
      @grivza 10 หลายเดือนก่อน

      I mean, k nearest neighbors classifier is a pretty descriptive name, how else would you want them to call it?

  • @muhammedyasinhanyasar539
    @muhammedyasinhanyasar539 10 หลายเดือนก่อน

    I was watching your videos, and you said in the one of your previous stream that "You know less about a subject, you are better from the others. You make done the job quickly." And than I've just run accross that video: th-cam.com/video/mBcBoGhFndY/w-d-xo.html --> "I built an image search engine". You know what? you were right, is he really built an image search engine, is he? Nah he's not. But at the same time he is. without having a purely understanding of the billions and zibillions of shit under the hood. You were right...

  • @luandkg
    @luandkg 10 หลายเดือนก่อน

    Omg, Tsoding are you learning portuguese ? “cinco” 👉🏼👈🏼

  • @Alguem387
    @Alguem387 10 หลายเดือนก่อน +1

    Try Odin

  • @jjonojj
    @jjonojj 10 หลายเดือนก่อน +1

    good video

  • @RandomGeometryDashStuff
    @RandomGeometryDashStuff 10 หลายเดือนก่อน

    24:59 Shell 76.5%

  • @MemeConnoisseur
    @MemeConnoisseur 10 หลายเดือนก่อน

    Woah

  • @tommyshelby6277
    @tommyshelby6277 10 หลายเดือนก่อน +2

    IF THERE IS DISCORD SERVER, HOW TF CAN'T I FIND IT ANYWHERE!!!

    • @ade5324
      @ade5324 10 หลายเดือนก่อน +1

      www.twitch.tv/tsoding/about

    • @H3llsHero
      @H3llsHero 10 หลายเดือนก่อน +4

      You ok bro? You can find an invite link on his twitch page

    • @dieSpinnt
      @dieSpinnt 10 หลายเดือนก่อน +3

      @@H3llsHero Do you think this is a good idea with all that shouting?
      Hehehehe:)

  • @ADRENELINEDUDE
    @ADRENELINEDUDE 10 หลายเดือนก่อน

    Any time someone says "AI" it's like saying a generic term. "Peaches are better than fiber", "AI" is a generic term that can literally mean anything. I can make a garbage AI that, yes, "Zip will work better", however, a very well crafted ML compression system would then be better.

  • @v2ike6udik
    @v2ike6udik 10 หลายเดือนก่อน

    1:21:54 !

  • @avi7278
    @avi7278 10 หลายเดือนก่อน

    No, no it isn't....dumb question.

  • @RandomGeometryDashStuff
    @RandomGeometryDashStuff 10 หลายเดือนก่อน

    58:15 how does multiply by 2 handle compressed data bigger than uncompressed data?
    hello.txt is 6 bytes and hello.txt.gz is 36 bytes and 6*2=12 and 12