Binary data exercise: how to tell if a file is a jpeg?

แชร์
ฝัง
  • เผยแพร่เมื่อ 23 ธ.ค. 2024

ความคิดเห็น • 77

  • @litlkaiser
    @litlkaiser ปีที่แล้ว +39

    The read method in ruby has an option for encoding, e.g. f.read(3, encoding="UTF-8")

    • @JacobSorber
      @JacobSorber  ปีที่แล้ว +7

      Good point. Thanks!

    • @Raugturi
      @Raugturi ปีที่แล้ว +4

      Maybe it's a little pedantic of me, but I'd just let Ruby read it in as binary and in my compare do "\xFF\xD8\xFF".encode('ASCII-8BIT'). We make the test bytes match what we expect rather than mutating what we're reading in to see if the mutated value matches something else. And I think there's an alias of 'BINARY' so "\xFF\xD8\xFF".encode('BINARY') should also work and is maybe more explicit about what we want.

    • @redcrafterlppa303
      @redcrafterlppa303 ปีที่แล้ว +1

      ​@@Raugturi can't ruby just read the bytes as a number array and let you create a number array with 3 hex constants and compare them?
      Would be weird if any language couldn't do numbers amd arrays.

    • @shivisuper91
      @shivisuper91 ปีที่แล้ว

      ​@@Raugturiwas about to write the exact same comment😅

  • @NonTwinBrothers
    @NonTwinBrothers ปีที่แล้ว +15

    Definitely would be interested if more file format videos are to come :)

  • @suncrafterspielt9479
    @suncrafterspielt9479 ปีที่แล้ว +13

    Lets have a deep dive into the meta data

  • @donaldmickunas8552
    @donaldmickunas8552 ปีที่แล้ว +11

    Interesting. I’m taking a python course currently. This will make an interesting exercise in python. Thanks! 😀

    • @CoolKoon
      @CoolKoon ปีที่แล้ว

      I'm pretty sure that this is not an issue in Python though as long as binary mode is being used.

  • @unperrier5998
    @unperrier5998 ปีที่แล้ว +5

    Python3 doesn't have this problem: as you can read directly binary and get "bytes" objects.

    • @JacobSorber
      @JacobSorber  ปีที่แล้ว

      Thanks. Good point. I've definitely still seen the bytes object cause some newcomer confusion when the programmer doesn't understanding why a bytes object and a string object (with a string of bytes) are not the same thing. It feels like a different flavor of the same problem - but maybe/probably a more sensible solution.

  • @bolter841009
    @bolter841009 ปีที่แล้ว +3

    Thank you for the example! Really a good intro to magic bytes 🙂 Maybe another nice video would be a simple jfif/exif low-level parser - just the big stuff - display the block type and size, could be useful for integrity check 🙂 most out-of-the-box libraries “fix” minor errors or ignore erroneous information when possible.

  • @unperrier5998
    @unperrier5998 ปีที่แล้ว +1

    At 16:50 isn't it better to encore the UTF-8 string into 8-bit ASCII instead?
    How can you be sure that the 3 bytes read from the files form valid utf-8?

    • @JacobSorber
      @JacobSorber  ปีที่แล้ว

      Yeah, probably. I just picked one, since I just needed the encodings to match. But, yes, if I were doing anything else with the strings, forcing both to ASCII-8 would have been better.

    • @rdwells
      @rdwells ปีที่แล้ว

      @@JacobSorber In this particular case, since you know that the string you're looking for is not a valid UTF-8 string, I think you'd definitely be better off using the ASCII-8BIT encoding. Otherwise, you're depending on the language to do the right thing when comparing invalid UTF-8 strings. If it is doing a byte-by-byte comparison you're safe, but if "equal" means "represents the same sequence of Unicode code points", I would think all bets are off if the strings being compared are not valid UTF-8 strings. (It is possible for two UTF-8 strings that are not identical byte-wise to be equal in terms of what they encode.)

  • @rexjuggler19
    @rexjuggler19 ปีที่แล้ว +1

    ❤ This is one of my favorites! A real world example. Unix/Linux has "magic" built in, and I've edited /etc/magic by hand to add file types, but I do data migrations, so this indepth how-to is very helpful for building custom tools myself regardless of platform. It maybe a bit off the channel topic, but it might be useful for you to do a video on encoding - ASCII, UTF, ISO, EBCDIC - and maybe even add byte transmission like 7bit even parity, 8bit no parity...stuff like that. Working at the atomic level of data is very helpful to develop a better understanding of computers in general.

  • @cprkan
    @cprkan ปีที่แล้ว +4

    I love your vids... they helped me understand pointers :)

    • @JacobSorber
      @JacobSorber  ปีที่แล้ว

      Yessss! Glad they helped.

  • @XESCoolX
    @XESCoolX ปีที่แล้ว

    10:17 I know it’s not super important, but I think it would make more sense to have this print “No!” instead of return an error.
    Because if it’s reading less, i.e. if the file is smaller than 3 bytes, then we know that it’s not a JPG.
    I’m not sure if it’s possible though for fread to fail, but the file may still be a JPG? Would it be safe to assume this doesn’t happen if fread does not return 3?

    • @redcrafterlppa303
      @redcrafterlppa303 ปีที่แล้ว

      You can check rather a file io error occurred or you hit end of file by calling and checking
      feof() //end of file
      And
      ferror() // error
      So the number returned really just serves to confirm the success case.
      To fully answer your question, you could do some retrying by rerunning fread or completely reopening the file in case fread did not fail with eof (which would confirm it not being a jpg as it is smaller than 3 bytes).
      But I'm not sure rather putting this much effort would be worth it in most cases.

  • @udeeptabhadra1578
    @udeeptabhadra1578 หลายเดือนก่อน

    This was informative and helpful. Thanks!

  • @reverse_shell
    @reverse_shell ปีที่แล้ว +1

    Yes for metadata and more file disassembly please!

  • @pseudopseudo3679
    @pseudopseudo3679 ปีที่แล้ว +1

    a video on reading/writing bitmap data would be cool :)

  • @eddaly4160
    @eddaly4160 ปีที่แล้ว +1

    Great video as usual, What is the proper way to exit a program?..."exit(EXIT_FAILURE)" or with "return EXIT_FAILURE"...or use EXIT_SUCCESS...not sure when to use "return" or "exit" to end the program.

    • @JacobSorber
      @JacobSorber  ปีที่แล้ว +1

      With most C runtimes, main is called like this from some other libC startup routine - result = main(argc, argv); exit(result); So, returning from main and calling exit are essentially doing the same thing. I suppose that calling exit from main rather than returning might be slightly more efficient (avoiding a function call return), but it's not likely to make any difference (especially once compiler optimizations get involved).

  • @gerdsfargen6687
    @gerdsfargen6687 ปีที่แล้ว +5

    Could you check if it is not a real jpg but may have some hidden data within the file?

    • @gerdsfargen6687
      @gerdsfargen6687 ปีที่แล้ว

      I know probably not to expect any reply from Jacob 😢

    • @JacobSorber
      @JacobSorber  ปีที่แล้ว

      Not sure I understand the question. Are you asking if the magic numbers could check out and it not follow the JPG format - maybe holding other data? Yes, it could.

    • @gerdsfargen6687
      @gerdsfargen6687 ปีที่แล้ว

      @JacobSorber oh wow,.hi Jacob!
      I suppose yes, I'm asking if those magic numbers could pose as a jpg yet maybe carry some hidden data within those very magic numbers.
      I want to thank you for your reply, and will take it on board when checking an example of this case out.
      Cheers!

  • @zrodger2296
    @zrodger2296 ปีที่แล้ว +2

    "Strings are strings; bytes are bytes." That's the way it should always be! Good video.

  • @samuelmartin7319
    @samuelmartin7319 ปีที่แล้ว

    I would love more videos on this topic!

  • @thomas_m3092
    @thomas_m3092 ปีที่แล้ว

    Why does the c version work? FF and D8 are outside the range of a char, which is normally signed.
    Shouldn't the compiler warn about it?

  • @MECHANISMUS
    @MECHANISMUS ปีที่แล้ว

    Helpful presentation!
    Why put env in shebang line instead of interpreter alone?
    Seems would be prettier with collapsed or squashed explorer.

  • @hashi856
    @hashi856 ปีที่แล้ว

    How are you using that not equal sign?

  • @k1defjoel397
    @k1defjoel397 ปีที่แล้ว

    I'm on a roll here binge watching your videos. Super impressed. Hoping you can help me connect the dots of my limited understanding. I was under the impression C++ has better string handling capabilities and figured C++ would be your go-to. In this case, you chose C for it's simplicity for its lack of encoding confusion. Does that mean C++ adds complexity compared to C, such that you'd need to concern yourself with encoding choices?

    • @JacobSorber
      @JacobSorber  ปีที่แล้ว +1

      Thanks! I'm glad they're helping out. C++ doesn't do any automatic string encoding stuff, and you could do essentially the same thing that I did here using C++ (fopen, fread are available in both). But, yes, C++ does provide some nice string-related tools (nowhere near what you get from python or ruby), but when I'm working with binary data and individual bytes, I often don't see an advantage to using C++, unless I need object oriented stuff elsewhere in the code. In this case, I didn't.

  • @robertstrickland9722
    @robertstrickland9722 ปีที่แล้ว

    I would love to see some videos on binary file manipulation, especially something like writing your own encryption program.

  • @billmoo
    @billmoo ปีที่แล้ว

    Any reason as to why you never closed the FILE * ?

  • @justcurious1940
    @justcurious1940 ปีที่แล้ว +1

    Thanks Jacob, great video, lets play with sockets and threads, I think it will be more fun.

  • @JonnyRobbie
    @JonnyRobbie ปีที่แล้ว +1

    Why not #define the MAGIC_NUM_BYTES? I know the low level difference between defines and declarations, but what is the high level practical difference?

    • @JacobSorber
      @JacobSorber  ปีที่แล้ว

      That would work, as well. I have a video about this somewhere in the list. Making it a variable allows the compiler to help you with type checking and some forms of error detection. #define might in some cases have performance advantages (I don't think it would in this case). For this example, both are viable options.

  • @atabac
    @atabac ปีที่แล้ว

    what IDE is that? looks like its using some syntactical sugar coating. it uses enequal symbol instead of != . small thing but kind a annoying it hides the real code hehe.

  • @andrewporter1868
    @andrewporter1868 ปีที่แล้ว

    wide char argv tutorial for Win32 wen (wmain and wWinMain)?

  • @tommcboatface1908
    @tommcboatface1908 ปีที่แล้ว

    Great video!

  • @raul_ribeiro_bonifacio
    @raul_ribeiro_bonifacio ปีที่แล้ว

    Just found out about this channel. Nice content!

    • @JacobSorber
      @JacobSorber  ปีที่แล้ว +1

      Thanks, and welcome!

  • @coolbrotherf127
    @coolbrotherf127 ปีที่แล้ว

    That's pretty cool. I've never tried to do this before.

  • @anon_y_mousse
    @anon_y_mousse ปีที่แล้ว +1

    That's completely bonkers to me that Ruby would have that issue. Especially considering that 7-bit ASCII maps exactly into UTF-8. At this point I should probably be keeping a list of reasons to never learn Ruby. As an aside, being a Linux user I would never bother to write such a program because `file` exists and I could just check with it, however, a good example you might want to make a future video for is serializing data structures. I prefer to use a text based method, such as TOML, for simple structures, but when it's complex I use the binary approach.

    • @redcrafterlppa303
      @redcrafterlppa303 ปีที่แล้ว

      I wrote an image grouping Tool that groups png files into 1 file (png hat it's own set of magic numbers) and I thought it would be problematic if my seperator identifiers would randomly appear in the binary file (unlikely but possibly). My theory (not confirmed) was that the magic numbers of the file format likely won't appear in the binary.
      So I literally packed the images byte to byte after each other and seperated them by splitting them at the magic start. It works and it's as efficient as possible.

  • @andreisoceanu4320
    @andreisoceanu4320 ปีที่แล้ว +1

    I love this way of questioning everything. Next: How to tell if a JPEG is a file.

    • @JacobSorber
      @JacobSorber  ปีที่แล้ว

      You would have to define what you mean by "JPEG". 🤔

    • @andreisoceanu4320
      @andreisoceanu4320 ปีที่แล้ว

      @@JacobSorber that is easy, just #define JPEG

  • @DelgardAlven
    @DelgardAlven ปีที่แล้ว

    feeling like home when somebody types things in C. Things are exactly what they are in 99%, and 1% stays just for rare machines’ unique memory conventions, and nothing else.

  • @flippert0
    @flippert0 ปีที่แล้ว

    6:51 nice jab at Windows

  • @beardlyinteresting
    @beardlyinteresting ปีที่แล้ว +1

    Because I usually do a
    typedef char byte
    or more specifically if I know I'm only targeting C99 and later
    typedef uint8_t byte
    just to be explicit when writing code. When I saw a char array of size 3 my brain went "that's not big enough, what about the null terminator" took my brain a second to go, "no it's just a byte array of fixed length so we know how long it is" 😅
    Also only using printf for error messages instead of fprintf(stderr, ...); bugged me way more then it should lol. Good explanation though and I love the ruby example because I've run into encoding issues before with scripting languages, never had an encoding problem with C. So it's certainly something to keep in mind.

    • @redcrafterlppa303
      @redcrafterlppa303 ปีที่แล้ว +1

      Wait until you try to use windows and c/c++ and you get
      char
      signed char
      unsigned char
      wchar_t
      char8_t
      char16_t
      char32_t
      All being different character types used in various functions in windows.
      Ps:
      And yes I looked it up char is defined as being neither unsigned char nor signed char in the Microsoft compiler.

    • @beardlyinteresting
      @beardlyinteresting ปีที่แล้ว

      @@redcrafterlppa303 Yeah that's why I use linux 🤣

  • @torrenttv7567
    @torrenttv7567 ปีที่แล้ว

    Please make a next video of socket server handle part 4 - event driven

  • @ramadhanafif
    @ramadhanafif ปีที่แล้ว +1

    Yes, this encoding bs really frustates me when I'm doing a byte or bit level manipulation in python. Things that seemingly so easy in C can get tangled due to mismatching data type.

    • @redcrafterlppa303
      @redcrafterlppa303 ปีที่แล้ว

      The worst thing is you don't directly see the datatypes because python is stupid (sorry not sorry)

  • @ercntreras
    @ercntreras ปีที่แล้ว

    Nice!

  • @sajolsajol8393
    @sajolsajol8393 ปีที่แล้ว

    Sir, Suggest me a book where I can learn about these things...

  • @ForeverNils
    @ForeverNils ปีที่แล้ว +2

    did you forgot to close file?

    • @pcuser80
      @pcuser80 ปีที่แล้ว

      Yeo i see no fclose(fp);

    • @RobertFletcherOBE
      @RobertFletcherOBE ปีที่แล้ว

      when a process exits its resources are released.

    • @pcuser80
      @pcuser80 ปีที่แล้ว

      @@RobertFletcherOBE Yep i know that.
      But is better to close/free all.
      For a short living program you dont have to use free. For programs that run always you must you use free.

    • @JacobSorber
      @JacobSorber  ปีที่แล้ว +1

      Yes, I did. Sorry about that.

    • @ForeverNils
      @ForeverNils ปีที่แล้ว

      @@RobertFletcherOBE ok but it would be nice to manually release resource when it's not needed any more

  • @greg4367
    @greg4367 ปีที่แล้ว

    Greetings from San Francisco. Let's get to the important stuff: Where can I get one of those malloc() T-shirts? It is not on you Merch section.

    • @JacobSorber
      @JacobSorber  ปีที่แล้ว

      I'm glad you have your priorities straight. It should be on there now.

    • @greg4367
      @greg4367 ปีที่แล้ว

      @@JacobSorber I'll get my order in now, thanks.

  • @yooyo3d
    @yooyo3d ปีที่แล้ว

    You should do proper jpg chunks reading. Testing the first 3 bytes is not enough. People can learn more about formats, and common practices how to work with binary data and maybe how to save and load their own binary files

  • @minhajsixbyte
    @minhajsixbyte ปีที่แล้ว

    basically an oddly specific version of "file" command/program

  • @soniablanche5672
    @soniablanche5672 11 หลายเดือนก่อน

    I don't think reading 3 random bytes as UTF-8 is a good idea, not all 3 bytes sequences are valid UTF-8 so your program might crash, give error or return a garbage string. I think it's better to convert the string you are comparing with to ASCII / bytes / char / whatever it's called in your language

  • @__hannibaal__
    @__hannibaal__ ปีที่แล้ว

    The Programmer word gave very scary Word like , CPU DPU GPU JPEG ZIP MPI LLVM …; that take me out fare but when dive in deep to understand i found it only to make different between principles; that remember me to very scared mathematical theorems that push people away to studying mathematics.

  • @brockdaniel8845
    @brockdaniel8845 5 หลายเดือนก่อน

    Pretty nicee

  • @DhruvTrivedi
    @DhruvTrivedi ปีที่แล้ว

    For those using GCC, you have to initialize magicNumber using malloc:
    char *magicNumber = malloc(MAGIC_NUM_BYTES * sizeof(char));