I compared Two PDF Libraries. C one was faster than Rust one.

แชร์
ฝัง
  • เผยแพร่เมื่อ 1 ส.ค. 2024
  • Chapters:
    - 00:00:00 - Intro
    - 00:00:40 - Recap
    - 00:03:15 - Demo
    - 00:04:59 - lopdf
    - 00:07:43 - Installing lopdf
    - 00:10:12 - Studying lopdf docs
    - 00:12:04 - Trying out lopdf
    - 00:16:18 - Trying out lopdf on bigger PDF
    - 00:20:31 - Performance concerns with lopdf
    - 00:21:19 - Extracting text with lopdf
    - 00:25:26 - Realizing a huge mistake
    - 00:26:28 - pdftotext
    - 00:29:21 - poppler
    - 00:32:06 - Studying poppler docs
    - 00:32:34 - Trying out poppler
    - 00:38:59 - Poppler is just faster
    - 00:40:40 - Extracting text with poppler
    - 00:47:06 - Price of poppler
    - 00:50:07 - Integrating poppler into indexing
    - 00:59:21 - Testing the indexing of PDF papers
    - 01:11:09 - On Reading Books
    - 01:14:16 - On The Price of Dependencies
    - 01:18:05 - Wrapping up
    - 01:18:53 - QnA
    - 01:21:53 - Outro
    - 01:22:07 - Smooch
    References:
    - Source Code: github.com/tsoding/seroost
    - Rust PDF library: github.com/J-F-Liu/lopdf
    - C PDF library: poppler.freedesktop.org/
    - PDF papers:
    - arxiv.org/abs/1706.03762
    - people.freebsd.org/~lstewart/...
    - arxiv.org/pdf/2303.12712.pdf
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 137

  • @----__---
    @----__--- ปีที่แล้ว +58

    the best way to learn something is to say something wrong and let people correct you
    - Sun Tzu

    • @SENTRY456123
      @SENTRY456123 ปีที่แล้ว +3

      "and I'd say he knows a little more about programming than you do, pal, because he invented it!"

    • @AlFredo-sx2yy
      @AlFredo-sx2yy 11 หลายเดือนก่อน

      @@SENTRY456123 "and then he perfected it so that no living man could best him in the Ring of Honor!"

  • @michaelmueller9635
    @michaelmueller9635 ปีที่แล้ว +50

    Well, these dependencies ...escalated quickly.

  • @inferrna
    @inferrna ปีที่แล้ว +30

    If you look into lopdf issues, you'll find a lot of complaints about speed. And also there is a soulution - switching to nom parser.

    • @TsodingDaily
      @TsodingDaily  ปีที่แล้ว +19

      Yes, it was already brought to my attention! Thanks! I'll look into that a bit later.

    • @aperson4051
      @aperson4051 ปีที่แล้ว +2

      ​@@TsodingDaily lopdf is indeed frustratingly slow, but enabling nom and multithreading (rayon) makes it at least bearable for mid sized PDFs
      ```
      lopdf = { version = "=0.29.0", default-features = false, features = ["nom_parser", "rayon"] }
      ```

  • @user-mq6qp2bm2r
    @user-mq6qp2bm2r ปีที่แล้ว +11

    The combinatorial operation at 46:12 will work if done like this:
    if let Some(content) = document.get_page(n).as_ref().and_then(|p| p.get_text()) {
    println!("Content: {}", content);
    } else {
    println!("Content not found!");
    }

  • @rodelias9378
    @rodelias9378 ปีที่แล้ว

    Such a great stream! Thx

  • @juanmacias5922
    @juanmacias5922 ปีที่แล้ว +10

    I stopped trying to read books, I just have pdf readers I use at 2x speed, and a browser extension that'll read what I highlight lol

  • @rt1517
    @rt1517 6 หลายเดือนก่อน +1

    "You query the knowledge database which is LLM."
    An LLM is not a knowledge database.
    It does not contain information like wikipedia does.
    Or let's say that the information is encoded in such a weird way ("stored" as parameters values) that it is no more consistent.
    So ChatGPT can easily provides wrong information or forget information.
    LLMs are text completion tools, and they can make up information to complete your prompt.

  • @-aexc-
    @-aexc- 15 วันที่ผ่านมา

    the thing about reading is that you get exposed to ideas that you arent initially seeking or know you want to ask about which i think it a big advantage to reading. if youre going in with only one question, yeah bwing anle to query it is cool

  • @thefirstDeathclaw
    @thefirstDeathclaw ปีที่แล้ว +4

    1:13:00 There is actually a model that does that, chatpdf

  • @4445hassan
    @4445hassan ปีที่แล้ว +38

    Creating performant and efficient programs mainly comes from knowing what you are doing. Rust choses default very differently than C does. C is pretty much performance by default and Rust chooses safety, predictability and something along the lines of theoretical correctness. For instance Files being unbuffered in Rust is certainly stupid from a Performance stand point but a `BufReader` which can be used for any reader is a lot more powerful and having that and Files still being buffered would be weird. What i appreciate about Rust the most is that my knowledge grants me speed and improves my software but even with the baseline naively implemented sofware i don't need to worry about memory unsafety and little about memory leaks which grants the possibility to focus your mind on the speed and efficiency.

    • @temper8281
      @temper8281 ปีที่แล้ว +2

      Issue with Rust is that you can write a program that maintains certain invariants that the Rust compiler has no idea about. That means you will always have to jump through more hoops than the equivalent C program. Any C programmer worth their salt is going to write programs and organise logic that has as many invariants as possible. So the equivalent C program is likely always going to end up faster.

    • @deadmarshal
      @deadmarshal ปีที่แล้ว +9

      Rust is a joke

    • @temper8281
      @temper8281 ปีที่แล้ว +3

      @@deadmarshal most likely tbh. Might just be a passing fad but who knows. Personally I dont think it really offers much

    • @weirdo911aw
      @weirdo911aw ปีที่แล้ว +1

      But you ARE worrying about memory unsafety. The borrow checker will always make you jump through hoops. Development speed is 10 times slower on Rust than C

    • @SomeRandomPiggo
      @SomeRandomPiggo ปีที่แล้ว +3

      ​@@temper8281 There has been a trend of new languages recently, they're all fads.

  • @ilyasabi8920
    @ilyasabi8920 ปีที่แล้ว +10

    Today's stream highlight 40:00 that's the spirit. 🤣

  • @Neuer_Alias_erstellen
    @Neuer_Alias_erstellen ปีที่แล้ว +2

    there might be a fork of poppler with just the nessesary stuff for pdf to txt

  • @inferrna
    @inferrna ปีที่แล้ว +8

    In fact, GPLv2 allows you dynamically linking.

    • @haakonness
      @haakonness ปีที่แล้ว +1

      Isn't that what lgpl is for?

    • @inferrna
      @inferrna ปีที่แล้ว +2

      ​@@haakonness Yes, indeed. But if you just redistribute the source code with instruction like "be sure you have poppler installed / place poppled.dll here" no Stollman follower can accuse you of violating the GPL.

  • @deadviny
    @deadviny 10 หลายเดือนก่อน

    What distro is he using?

  • @gertrudessampaio8689
    @gertrudessampaio8689 3 หลายเดือนก่อน

    Lopdf is great now, the performance was upgraded so must 😊.

  • @TECHN01200
    @TECHN01200 ปีที่แล้ว +1

    And this is how we get left pad...

  • @lorenzo42p
    @lorenzo42p ปีที่แล้ว +3

    gpl requires you share the source code if you distribute binaries. if you don't distribute then you don't have to share. I'm a fan of agpl, which considers using software over a network as distribution.

  • @oj0024
    @oj0024 ปีที่แล้ว +6

    I've read that poppler is quite slow and that v8's pdf parser and mupdf are faster.

    • @oj0024
      @oj0024 ปีที่แล้ว +4

      There is a mupdf ruat wrapper that has almost no dependencies (from what I can tell).

    • @user-dh8oi2mk4f
      @user-dh8oi2mk4f ปีที่แล้ว +1

      Why does v8, a javascript engine, have a pdf parser?

    • @oj0024
      @oj0024 ปีที่แล้ว +1

      @@user-dh8oi2mk4f because it can open pdf files

    • @user-dh8oi2mk4f
      @user-dh8oi2mk4f ปีที่แล้ว +1

      @@oj0024 There is no pdf parser in the v8 codebase, and v8 cannot open pdf files.

    • @oj0024
      @oj0024 ปีที่แล้ว

      @@user-dh8oi2mk4f ah, I confused v8 with chrome it self. (and didn't read your comment correctly)

  • @FDominicus
    @FDominicus ปีที่แล้ว +5

    Maybe you like to try the V language?

  • @Anhar001
    @Anhar001 ปีที่แล้ว +4

    My man just use Pandoc and then xargs that shit in bash

  • @blastygamez
    @blastygamez ปีที่แล้ว +12

    I am speed. ~the c programming language

    • @blastygamez
      @blastygamez ปีที่แล้ว +8

      I love c :D

    • @anon_y_mousse
      @anon_y_mousse ปีที่แล้ว +4

      @@blastygamez It is the best programming language.

    • @blastygamez
      @blastygamez ปีที่แล้ว +3

      @@anon_y_mousse facts

    • @rian0xFFF
      @rian0xFFF ปีที่แล้ว +2

      @@anon_y_mousse The mother language

  • @barbiefan3874
    @barbiefan3874 8 หลายเดือนก่อน

    "I wonder what is a pom_parser... I don't really know what it is... anyway"
    "It's taking THAT much time to process this pdf?"
    I mean... nom_parser is 6 times faster on my machine. But still it shouldn't be that long even for pom_parser.
    It takes ~300ms with pom and ~55 with nom for me on that huge 155 page pdf
    Loving your content btw, thank you.
    (Are you in russia? If so I could donate you...)

  • @msakg
    @msakg ปีที่แล้ว +4

    If the rustc compiler intelligent enough dependencies shouldn't be an problem if you worried about compiled binary after link time optimizations. I mean I'm totally agree with you but today's world no one respects for optimized software, cause we have 512cores cpu etc. so as you said every app especially on web bloated!

  • @drygordspellweaver8761
    @drygordspellweaver8761 ปีที่แล้ว +8

    C is pretty damn legendary. If it just had Pascal strings / array length with less cryptic type declarations then it would be God tier.
    Btw- I am in your discord but have no possible way to verify so can't say anything there. Can you fix it please.

    • @cassandradawn780
      @cassandradawn780 ปีที่แล้ว +1

      i think you have to support him in some way to gain that privelege

  • @hashtag9990
    @hashtag9990 ปีที่แล้ว +4

    *FAQ*
    *why does this library take too much time to process?*
    because you aren't using your time for anything better

  • @lokthar6314
    @lokthar6314 ปีที่แล้ว +1

    I watched a vid of you from 2 years ago and you were fat, what happened and how did you lose weight so fast? you're looking good my man

  • @oabragh
    @oabragh ปีที่แล้ว

    Did u even optimize the rust build

    • @ussuratoncachi
      @ussuratoncachi ปีที่แล้ว +2

      Yeah, he did. And lopdf was astonishingly slow

  • @cjt9150
    @cjt9150 ปีที่แล้ว +1

    last release of lopdf is Mar 7, 2019 - discontinued development.
    last release of poppler is Apr 2, 2023 - continuously developed.
    why u compare these two. eventually lopdf may be ......

  • @sug_madic7683
    @sug_madic7683 ปีที่แล้ว +3

    why porn folder 7.0 gib is doing on taskbar bottom of you screen at 2:57

  • @royendgelsilberie7028
    @royendgelsilberie7028 ปีที่แล้ว +6

    lopdf takes 1.60 and 0.30s (release mode) on a 369 pages pdf
    For me doing :
    cargo add lopdf
    use lopdf::Document;
    let document = Document::load("page.pdf").unwrap();
    let b = document.get_pages().len();
    eprintln!("Pages here : {:?}", b);

  • @orgs804
    @orgs804 ปีที่แล้ว +1

    could you send a discord invite link to your server?

  • @name._..-.
    @name._..-. ปีที่แล้ว

    53:00 comments can solve this problem...

  • @movization
    @movization ปีที่แล้ว +9

    bloody pdf files 🙈

  • @taxerap8498
    @taxerap8498 ปีที่แล้ว +2

    39:46😂

  • @siriusleto3758
    @siriusleto3758 ปีที่แล้ว +3

    And now? Faster with exploits or slower with no exploits?

  • @energy-tunes
    @energy-tunes ปีที่แล้ว +2

    In other news water is wet

  • @anon_y_mousse
    @anon_y_mousse ปีที่แล้ว +20

    Personally, I'm sick of people telling me Rust is faster than C. It's really not, it's equally as fast at best, but compilation speed matters to me as I'll be the one writing and compiling the code. So my experience is that it's actually significantly slower because it takes me longer to iterate through any changes I may need to make to my code. As far as libraries go, I've got enough to choose from in C that Rust doesn't win there either. Aside from being able to find them easily enough, I've written quite a few too.
    Of course, I would love for more of these Rustaceans to prove what they think they know and write it in C, if they can. Oh you know how a data structure works so you don't need to write it from scratch? Prove it. After all, if the programming language doesn't really matter and they're just using it because it makes using what they know easier, then it shouldn't matter if you're using Rust or C as long as you know what libraries to use when writing your code.
    In case my point there wasn't clear, the programming language does matter and Rust makes it harder to write code fast and do work. My side point there is that having everything in a standard library doesn't matter because there's enough open source code that has had decades of debugging behind it and will work better than what most could write from scratch.

    • @postmodernist1848
      @postmodernist1848 ปีที่แล้ว +10

      I always hope that the language I learn won't just be "yet another systems programming language" and C isn't the best we've got after 50 years, but it often still is the best language for many use cases. Although some Rust features are quite attractive, Tsoding is right saying that to write fast Rust you really need to know what's going on, so the entire abstraction really is kinda pointless, but at least it's SAFE.
      P. S. I wonder how he would handle unicode without Rust char/String type (in C, for example, it's quite painful)

    • @vvarhand3985
      @vvarhand3985 ปีที่แล้ว +2

      What about memory and type safety? Those are some of the biggest advantages of Rust compared to other langs. Not being sarcastic, I’m interested in your take on this

    • @postmodernist1848
      @postmodernist1848 ปีที่แล้ว +2

      More static analysis is always good, I guess, and that safety makes Rust feel a lot more high level than it actually is, but sometimes Rust makes things more difficult. As one Rust TH-cam preacher said, "in Rust, simple is possible and complex is easy".
      By the way, the original commenter's point about the standard library is debatable, because many people find convenience in there being only one library for a thing (unlike javascript, for example, where you have 10 xml parsers available; I know it's not that good from a free-software-alternatives point of view, but the convenience it offers)

    • @dnels493
      @dnels493 ปีที่แล้ว +9

      I don't know C. I come to Rust from the opposite side of the stack in a quest to continue going lower and lower. From my perspective, Rust provides a fairly easy to use, high level language that is incredibly fast (in comparison to dynamic languages) with an incredible type system. Rust isn't just a low level language competing with C, it's also a high level language competing with xyz high level languages but with huge advantages.

    • @bartpelle3460
      @bartpelle3460 ปีที่แล้ว

      You're presenting yourself no better than the people you antagonize, what the fuck is your problem lmao

  • @workflowinmind
    @workflowinmind ปีที่แล้ว

    I don't understand why you don't just read the files as utf8? no library needed for what you do with it

    • @Tigregalis
      @Tigregalis ปีที่แล้ว +10

      pdf is a binary format.
      i hate pdf

  • @PyczekFromPoland
    @PyczekFromPoland ปีที่แล้ว +1

    Hello

  • @billgrant7262
    @billgrant7262 ปีที่แล้ว +4

    16 gb of ram? That's pretty low nowadays.

    • @TsodingDaily
      @TsodingDaily  ปีที่แล้ว +18

      My laptop has 8 gb PepeHands

    • @movization
      @movization ปีที่แล้ว +19

      @@TsodingDaily 8 gb ought to be enough for anybody

    • @antronixful
      @antronixful ปีที่แล้ว +8

      owned, destroyed and humbled with facts and real life examples

    • @antronixful
      @antronixful ปีที่แล้ว +1

      @@ThomasVWorm based

    • @antronixful
      @antronixful ปีที่แล้ว +2

      @@ThomasVWorm i don't know if this is a layer of sarcasm i'm too stupid to comprehend, but if not, look at the definition in urbandictionary, since you cannot share any links in youtube comments

  • @i007c
    @i007c ปีที่แล้ว

    the more that i look at rust the more i dont want to learn it

  • @antronixful
    @antronixful ปีที่แล้ว +10

    i really hate rust because of two things:
    1.- it treats you as an idiot in the most annoying passive aggressive way
    2.- the community

    • @maelstrom254
      @maelstrom254 ปีที่แล้ว +1

      What do community wrong?

    • @antronixful
      @antronixful ปีที่แล้ว +7

      @@maelstrom254 "our language is the best, come poor developer and be happy for the first time in your profesional (and possibly, your overall) life. Make every other language look bad on the internet and bump every rust related stuff, be part of the crates/cargo mentality shift and forget about painful package management. Be rust, be rustacean, be happy"

    • @antronixful
      @antronixful ปีที่แล้ว +7

      @@maelstrom254 i almost forgot: "security driven development"

    • @maelstrom254
      @maelstrom254 ปีที่แล้ว +5

      @@antronixful agree, it’s a cult

    • @antronixful
      @antronixful ปีที่แล้ว +8

      @@maelstrom254 finally: "did you know that rust is the most beloved language since the year 400BC and that aristotle used it to prove the existence of prime numbers, because C was to insecure?"

  • @DungVu-di7dz
    @DungVu-di7dz 9 หลายเดือนก่อน

    I have tried running it many times, the truth is that C++ is 2 times faster than rust

  • @BraxtonMeyer
    @BraxtonMeyer ปีที่แล้ว +1

    this title is bad. it's not a really a surprise, what would be a surprise is if the Rust one breaks before the C one.. Rust is not meant to beat C in performance, because that's fucking idiotic, ti's meant to be on par with C preformance with additional safety.

    • @johnvayianos9077
      @johnvayianos9077 ปีที่แล้ว +5

      Who hurt you bro lmao.

    • @xavhow
      @xavhow ปีที่แล้ว +3

      Same thoughts.
      I don’t have to be c expert to know Rust can’t beat c in performance most of the time!
      But I do believe safe Rust is safer than c most of the time.
      If Rust can be on par with c in performance sometimes, it’s a win for Rust, IMO.

  • @serkan_bayram
    @serkan_bayram ปีที่แล้ว +1

    just write your own pdf library 🫠

  • @rogo7330
    @rogo7330 ปีที่แล้ว +3

    wtf is `Ok(())`?

    • @henrylang699
      @henrylang699 ปีที่แล้ว +9

      in rust there's a type called result which can represent either a Ok value or an Err value, which is used to represent things that could fail. in this case, the ok value is an empty type (), which means that it succeeded but returned nothing

    • @rogo7330
      @rogo7330 ปีที่แล้ว +2

      @@henrylang699 ok, now I am understand even less. It's can be checked outside of this block/function/whatever or something?

    • @Tigregalis
      @Tigregalis ปีที่แล้ว +5

      i'm extremely sympathetic to your confusion. the first time i saw Ok(()) at the end of all these functions, i had the exact same response. it took me a few attempts to understand what was going on.
      there are three things going on here.
      1. an "empty tuple" () is the return type for a function that returns nothing, it is a tuple with 0 elements. it's practically the same as void. for comparison a "two tuple" would look something like (a, b). normally, for convenience reasons, you don't need to write () in the function return type definition or the body of the function (although technically you can, there's just no need for it).
      2. rust has enums which are tagged unions, i.e. the variants hold types of their own. in this case there is the Result enum, which has two variants: Ok(T), and Err(E). the T and E are generics, so they could each be any concrete types. Ok represents the success result, and T is that successful result. Err represents the failure result, and E is that failed result. Crucially, it can only be one or the other, and to get either T or E you have to check which one it is.
      3. the last expression in any block without a semicolon is an implicit return, i.e. you don't have to write return x; so in this case you are implicit returning Ok(())
      putting these together, you have a function that doesn't return anything i.e. it returns nothing, or in other words it returns the value (). but it could fail, so it returns a result which contains either () or some error. at the end of the function body (after your code successfully runs) you have to return a Result, because the function's return type expects a Result, so you must return the value Ok(()), unfortunately because it's not a plain () you can't simply omit the value. as it's at the end of the function body, you can do an implicit return by just writing `Ok(())` instead of `return Ok(());`
      you should read the rust book and it will all become clear. rust is too different from other languages in often very complex ways that you can't get by without using rust-specific learning material, but probably as a result of that, the rust-specific learning material is very very good.
      edit: and yes, the caller has to handle the result of this function. with Ok(()) there's no interesting data in the success case, only the fact of success itself, but the error may be interesting

    • @rogo7330
      @rogo7330 ปีที่แล้ว +1

      @@Tigregalis so, basically it's just a syntactic sugar (which is I despite a little). Idk, maybe it will add something to this language if it will allow to make more specific predictions for compiler, but right now I think its just another C++.

    • @Tigregalis
      @Tigregalis ปีที่แล้ว +9

      @@rogo7330 I consider syntactic sugar to be a way to express an idea with less code (less noise). In a way this is that, but I don't think so. It allows you to write different code, or express different ideas (higher level ideas) because you're not fighting with the language or writing so much boilerplate (noise).
      Here's the important thing about error handling in Rust. There are no exceptions, and there is no try/catch. You as the programmer must handle the error, or the program will not compile. At every step of the program that could fail, you have to check, does it succeed or does it fail? If it succeeded, do X. if it failed, do Y, even if that means to tell it to panic (safely crash). in many other languages you don't have this information: anywhere down the callstack might throw an exception, and what type is that exception? Where do you place your try/catch? In rust you don't have this problem. in a way, Rust is pessimistic about code that could fail, while just about every other language is optimistic about code that could fail. In Rust, if the code fails, it's OK, you've definitely already handled it. In other languages, who knows?
      You should read the book and try the language. It's really easy to set up. Form your own opinion, absolutely, but form it after trying the thing first. don't dismiss it as "just another C++" (it definitely isn't, believe me) without trying it.