This INCREDIBLE trick will speed up your data processes.

แชร์
ฝัง
  • เผยแพร่เมื่อ 24 ม.ค. 2025

ความคิดเห็น • 396

  • @miaandgingerthememebunnyme3397
    @miaandgingerthememebunnyme3397 2 ปีที่แล้ว +354

    First post! That’s my husband he knows about data…

    • @LuisRomaUSA
      @LuisRomaUSA 2 ปีที่แล้ว +12

      He knows a lot of good stuff about data 😁. His the first non-introductory Python TH-camr I have found so far 🎉

    • @venvanman
      @venvanman 2 ปีที่แล้ว +9

      aww this is cute

    • @sketch1625
      @sketch1625 2 ปีที่แล้ว +12

      Guess he's really in a "pickle" now.

    • @foobarAlgorithm
      @foobarAlgorithm 2 ปีที่แล้ว +3

      Awww now you guys need a The DataCouple channel if you both do data science! Love your content

    • @Arpan_Gupta
      @Arpan_Gupta 2 ปีที่แล้ว

      Nice work Mr. ROB

  • @mschuer100
    @mschuer100 2 ปีที่แล้ว +29

    As always, awesome video...a real eye opener on most efficient file formats. I have only used pickle as compression, but will now investigate feather and parquet. Thanks for putting this together for all of us.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +4

      Glad it was helpful! I use parquet all the time now and will never go back.

  • @lashlarue7924
    @lashlarue7924 ปีที่แล้ว +2

    You are my new favorite TH-camr, Sir. I'm learning more from you than anyone else, by a country mile!

  • @holgerbirne1845
    @holgerbirne1845 2 ปีที่แล้ว +36

    Very good video :). One note: pickle files can be compressed. If you compress them, they become much smaller but reading and writing becomes slower. Overall parquet und feather are still much better.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +5

      Good point! There are many ways to save/compress that I probably didn't cover. Thanks for watching the video.

  • @DainiusKirsnauskas
    @DainiusKirsnauskas 9 หลายเดือนก่อน +2

    Man, I thought this video is a clickbait, but it was awesome. Thank you!

  • @Jvinniec
    @Jvinniec 2 ปีที่แล้ว +27

    One really cool feature of .read_parquet() is that it passes through additional parameters for whichever backend you're using. For example the filters parameter in pyarrow allows you to filter data at read, potentially making it even faster:
    df = pd.read_parquet("myfile.parquet", filters=[('col_name', '

    • @robmulla
      @robmulla  2 ปีที่แล้ว +9

      Whoa. That is really cool. I didn't realize you could do that. I've used athena which allows you to query parquet files using standard SQL and it's really nice.

    • @PhilcoCup
      @PhilcoCup 2 ปีที่แล้ว +4

      Athena is amazing when backed with parquet files, I've used it in order to be able to read through 600M+ records that were in those parquets easily

    • @incremental_failure
      @incremental_failure ปีที่แล้ว +3

      That's the real use case for parquet. Feather doesn't have this.

  • @nancyzhang6790
    @nancyzhang6790 2 ปีที่แล้ว +1

    I saw people mentioned feather on Kaggle sometimes, but had no clue what they were talking about. Finally, I got answers to many questions in my mind. Thank you!

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Yes. Feather and parquet formats are awesome for when you want to quickly read and write data to disk. Glad the video helped you learn!

  • @walterpark8824
    @walterpark8824 2 ปีที่แล้ว +4

    Exactly what I needed to know, and to the point. Thanks.
    As Einstein said, 'Everything should be as simple as possible, and no simpler!'

    • @robmulla
      @robmulla  ปีที่แล้ว

      That’s a great quote. Glad you found this helpful.

  • @nascentnaga
    @nascentnaga 10 หลายเดือนก่อน

    as someone moving into datascience this is such a great explainer! thank you

  • @KirowOnet
    @KirowOnet ปีที่แล้ว +1

    This was the first video from the channel that randomly appeared in my feed. I clicked, I watched - I liked and subscribed :D. This video plant a seed into my mind, some others inspired me to try. So few days later I got running playground environment in the docker. I'm not data scientist but tips and tricks from your videos could be useful for any developer. I used to code before to check some datasets, but with pandas and jupiter notebook it way more faster. Thank You for sharing your experience !

    • @robmulla
      @robmulla  ปีที่แล้ว

      Wow, I really appreciate this feedback. Glad you found it helpful and got some code working yourself. Share with friends and keep an eye out for new videos dropping soon!

  • @bothuman-n4b
    @bothuman-n4b ปีที่แล้ว

    Hi Rob. I'm from Argentina, you are the best!!!

  • @rrestituti
    @rrestituti 2 ปีที่แล้ว +1

    Amazing! Got one new member. Thanks, Rob! 😉

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Glad you liked it. Thanks for commenting!

  • @wonderland860
    @wonderland860 2 ปีที่แล้ว +2

    This video greatly helped me. I didn't know so many ways to dump a DataFrame. I then did a further test, and found the compression option plays a big role:
    df.to_pickle(FILE_NAME, compression='xz') -> 288M
    df.to_pickle(FILE_NAME, compression='bz2') -> 322M
    df.to_pickle(FILE_NAME, compression='gzip') -> 346M
    df.to_pickle(FILE_NAME, compression='zip') -> 348M
    df.to_pickle(FILE_NAME, compression='infer') -> 679M # default compression
    df.to_parquet(FILE_NAME, compression='brotli') -> 334M
    df.to_parquet(FILE_NAME, compression='gzip') -> 355M
    df.to_parquet(FILE_NAME, compression='snappy') -> 423M # default compression
    df.to_feather(FILE_NAME) -> 500M

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Nice findings! Thanks for sharing. Funny that compressing parquet still works. I didn't know that.

    • @DeathorGloryNow
      @DeathorGloryNow ปีที่แล้ว

      @@robmulla Actually if you check the docs parquet files are snappy compressed by default. You have to explicitly say `compression=None` to not compress it.
      Snappy is the default because it adds very little time to read/write with modest compression and low CPU usage while still maintaining the very nice columnar properties (as you showed in the video). It is also the default for Spark.
      Other compressions like gzip get it smaller but at a much more significant cost to speed. I'm not sure this is still the case but in the past they also broke some of the nice properties because it is compressing the entire object.

  • @69k_gold
    @69k_gold 11 หลายเดือนก่อน

    I looked this up, and it's a pretty cool format, I kinda guessed that it could be a column-based storage strategy when you said that we can efficiently get only select columns, but after I looked it up and found it to be true, it felt very exciting.
    Anyways, hats off to Google's engineers for thinking out of the box on this, the number of things we can do just by storing data as column-lines rather than row-lines is a lot. Of course, the trade-off is that it's very expensive to modify column-wise data, so this is more useful for static datasets that require multi-dim analysis

  • @Banefane
    @Banefane 11 หลายเดือนก่อน

    Very clear, very structured, and the details are intuitive to understand!

  • @jamesafh99
    @jamesafh99 5 หลายเดือนก่อน

    Thanks a lot! Loved the video and helped me with what I need it ❣️ Keep going with these videos. They really worthy 🔥

  • @gustavoadolfosanchezhurtad1412
    @gustavoadolfosanchezhurtad1412 ปีที่แล้ว +2

    Very clear and insightful explanation, thanks Rob, keep it up!

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks Gustavo. I’ll try my best.

  • @bendirval3612
    @bendirval3612 2 ปีที่แล้ว +21

    A major design objective of feather is to be able to be read by R. If you are doing pandas-type data science stuff, this is a significant advantage.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +10

      Great point. The R package called "arrow" can read in both parquet and feather files.

  • @humbertoluzoliveira
    @humbertoluzoliveira ปีที่แล้ว +1

    Hey Guy, nice job. Congratulations! Thanks for video.

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks for watching Humberto.

  • @beethovennine
    @beethovennine 2 ปีที่แล้ว +1

    Rob, you did it again...keep'em coming, good job!

  • @chrisdsouza1885
    @chrisdsouza1885 4 หลายเดือนก่อน

    These file saving methods are really useful 😊

  • @spontinimalky
    @spontinimalky 2 หลายเดือนก่อน

    You explain very clearly. Thank you.

  • @cristianmendozamaldonado3241
    @cristianmendozamaldonado3241 ปีที่แล้ว +1

    I really love it man, thank you. You saved a life

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks! Maybe not saved a life, but saved a few minutes of compute time!

  • @FilippoGronchi
    @FilippoGronchi 2 ปีที่แล้ว +2

    Excellent as usual Rob...very very useful indeed

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Thank you sir!

  • @reasonableguy6706
    @reasonableguy6706 2 ปีที่แล้ว +17

    Rob, You're a natural communicator (or you worked really hard at acquiring that skill) - most effective. I follow you on twitch and I'm currently going through your youtube content to come up to speed. Thanks for sharing your time and experience. Have you thought about aggregating your content into a book as a companion to your content - something like "Data Analysis Using Python/Pandas - No BS, Just Good Stuff" ?

    • @robmulla
      @robmulla  2 ปีที่แล้ว +6

      Hey. Thanks for the kind words. I’ve never considered myself a naturally good communicator and it’s a skill I’m still working in but I appreciate your positive feedback. The book idea is great, maybe sometime in the future….

  • @niflungv1098
    @niflungv1098 2 ปีที่แล้ว +4

    This is good to know. I`m going into web development now, so I usually use JSON format for serialization... I`m still new to python so I didn`t know about parquet and feather. Thank you!

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Glad you found it helpful. Share it with anyone else you think would benefit!

  • @arielspalter7425
    @arielspalter7425 2 ปีที่แล้ว +1

    Excellent tutorial Rob. Subscribed!

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Thanks so much for the feedback. Thanks for subscribing!

  • @pablodelucchi353
    @pablodelucchi353 2 ปีที่แล้ว +2

    Thanks Rob, awesome information! Learning a lot from your channel. Keep it up!

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Isn’t learning fun?! Thanks for watching.

  • @truthgaming2296
    @truthgaming2296 ปีที่แล้ว

    thanks rob, its help me a lot for beginner like me to realize there is weakness in csv format 😉

  • @jeremynicoletti9060
    @jeremynicoletti9060 6 หลายเดือนก่อน

    Thanks for sharing; I think I'll start using feature and parquet for some of my data needs.

  • @ozymet
    @ozymet ปีที่แล้ว +1

    Very good stuff. The essence of information.

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad you liked it!

    • @ozymet
      @ozymet ปีที่แล้ว

      @@robmulla I saw few more videos, insta sub. Thank you. Glad to find you.

  • @javiercmh
    @javiercmh ปีที่แล้ว +1

    Very engaging and clear. Thanks!

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Thanks for watching. 🙌

  • @anoopbhagat13
    @anoopbhagat13 2 ปีที่แล้ว +1

    learnt something new today. Thank you Rob for this useful & informative video.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Learn something new every day and before long you will be teaching others!

  • @marcosoliveira8731
    @marcosoliveira8731 2 ปีที่แล้ว +2

    I've learned a great deal with this video. Thank you!

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Thanks so much for the feedback. Glad you learned from it!

  • @Schmelon
    @Schmelon ปีที่แล้ว +1

    interesting to learn the existence of parquet and feather files. nothing beats csv for portability and ease of use

    • @robmulla
      @robmulla  ปีที่แล้ว

      Yea, for small/medium files CSV gets the job done.

  • @gsm7490
    @gsm7490 9 หลายเดือนก่อน

    Parquet really saved me )
    Around one year data, each day is appr 2GB (csv format). Parquet is both compact and fast.
    But have to use filtering and load only necessary columns “on demand”.

  • @arpanpatel9191
    @arpanpatel9191 2 ปีที่แล้ว +2

    Great video!! Small things matter the most. Thanks

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Absolutely! Thanks.

  • @chrisogonas
    @chrisogonas ปีที่แล้ว +1

    Great stuff! Thanks for sharing.

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Glad you enjoyed it!

    • @chrisogonas
      @chrisogonas ปีที่แล้ว

      @@robmulla 👍

  • @casey7411
    @casey7411 ปีที่แล้ว +1

    Very informative video! Subscribed :)

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Glad it helped! 🙏

  • @MatthiasBussonnier
    @MatthiasBussonnier 2 ปีที่แล้ว +3

    On the first pass when you timeit the csv writing you time both the writing to csv and generating the dataset. So you are likely having biased results as you only time the writing with other format. (Sure it does not change the final message, just want to point it out)
    Also with timeit, you can use the -o flag of timeit to output the result to a variable, and this can help you to for example make a plot of the times.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Good point about timing the dataframe generation. It should be negligable but fair to note. Also great tip on using -o. I didn't know about that! It looks like from the docs it writes the entire stdout, so it would need to be parsed. ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit Still a handy tip. Thanks!

  • @krishnapullak
    @krishnapullak 2 ปีที่แล้ว +1

    Good tips on speeding up large file read and write

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Glad you liked it! Thanks for the feedback.

  • @mr_easy
    @mr_easy 10 หลายเดือนก่อน +1

    great comparison. What about HDF5 format? Is it in anyway better?

  • @JohnMitchellCalif
    @JohnMitchellCalif ปีที่แล้ว +1

    super clear and useful! Subscribed

    • @robmulla
      @robmulla  ปีที่แล้ว

      Awesome, thank you!

  • @DAN_1992
    @DAN_1992 ปีที่แล้ว +2

    Thanks a lot, just brought down my database backup size to MBs.

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Glad it helped. That’s a huge improvement!

  • @MrWyYu
    @MrWyYu 2 ปีที่แล้ว +1

    Great summary of data types. Thanks

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      Thanks for the feedback! Glad you found it helpful.

  • @danieleingredy6108
    @danieleingredy6108 ปีที่แล้ว +1

    This blew my mind, duuude

    • @robmulla
      @robmulla  ปีที่แล้ว

      Happy to hear that! Share with others so their minds can be blown too!

  • @baharehbehrooziasl9517
    @baharehbehrooziasl9517 ปีที่แล้ว +1

    Great! Thank you for this very helpful video.

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      Glad it was helpful!

  • @MarcBenkert001
    @MarcBenkert001 2 ปีที่แล้ว +10

    Thanks, great comp. One thing about Parquet - it has some limitations in what chars column names can take, I spent quite some time renaming col names 1 year ago - perhaps that has fallen away by now.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +3

      Good point! I've noticed this too. Definately a limitation that makes it sometimes unusable. Thanks for watching!

  • @olucasharp
    @olucasharp ปีที่แล้ว +1

    Huge thanks for sharing 🍀

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad you liked it? Thanks for the comment.

  • @pawarasiriwardhane3260
    @pawarasiriwardhane3260 ปีที่แล้ว +1

    This content is really awesome

    • @robmulla
      @robmulla  ปีที่แล้ว

      Appreciate that!

  • @MAKSIMILIN-h8e
    @MAKSIMILIN-h8e 2 ปีที่แล้ว +1

    Nice video. I'm going to rewrite the storage on the parquet

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      You should! Parquet is awesome.

  • @danilzubarev2952
    @danilzubarev2952 ปีที่แล้ว

    Lol this video changed my life :D Thank you so much.

  • @Extremesarova
    @Extremesarova 2 ปีที่แล้ว +16

    Informative video! I've heard about feather and pickle, but never used them. I think I should give feather and parquet a try!
    I'd like to get some materials on machine learning and data science that are not introductory - something for middle and senior engineers :)

    • @robmulla
      @robmulla  2 ปีที่แล้ว +3

      Glad you found it useful. I’ll try to make some more ML videos in the near future.

  • @CalSticks
    @CalSticks 2 ปีที่แล้ว

    Really useful video - thanks.
    I was just searching for some Pandas videos for some light upskilling on the weekend, so this was a great find.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Glad I could help! Check out my other videos on pandas too if you liked this one.

  • @melanp4698
    @melanp4698 ปีที่แล้ว +1

    12:28 "When your data set gets very large." - Me working with 800GB json files: :)
    Good video regardless, i might give them a test sometime.

    • @robmulla
      @robmulla  ปีที่แล้ว

      Haha. It’s all relative. When your data can’t fit in local ram you need to start using things like spark.

  • @coopernik
    @coopernik ปีที่แล้ว +1

    I’m working on a little project and I have a csv file that’s 15GB. If I get what you’re telling me, I could turn it into a parquet file and save tons of memory space and time?

  • @huuquannguyen6688
    @huuquannguyen6688 2 ปีที่แล้ว +1

    I really hope you make a video about Data Cleaning in Python soon. Thanks a lot for all your awesome tutorials

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      I'll try my best. Thanks for the feedback!

  • @pele512
    @pele512 2 ปีที่แล้ว +5

    Thanks for the great benchmark. In R / Python hybrid environment I sometimes use `csv.gz` or `tsv.gz` to address the size issue with CSV but retain the ability to quickly pipe these through line based processors. It would be interesting to see how gzipped flat files perform. I do agree that parquet/feather is a better way to go for many reasons, they are superior especially from the data engineering point of view.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +2

      I do the same with gzipped CSV files. Good idea about making a comparison. I’ll add it to the list of potential future videos.

  • @hugoy1184
    @hugoy1184 ปีที่แล้ว +1

    Thank u very much for sharing such useful skills! 😉Subscribed!

    • @robmulla
      @robmulla  ปีที่แล้ว

      Anytime! Glad you liked it.

  • @riessm
    @riessm 2 ปีที่แล้ว +1

    In addition to everything, parquet is the native file format to spark and can fully support spark‘s lazy computing (spark will only ever read the columns and rows that are needed for the desired output). If you ever prep really big data for spark, parquet is the way to go.

    • @robmulla
      @robmulla  2 ปีที่แล้ว +1

      That’s a great point. Same with polars!

    • @riessm
      @riessm 2 ปีที่แล้ว

      @@robmulla Need to have a closer look at polars then! 🙂

  • @SergioBerlottoJr
    @SergioBerlottoJr 2 ปีที่แล้ว +1

    Awesome informations ! Thankyou for this.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Glad you liked it!

  • @rafaelnegreiros_analyst
    @rafaelnegreiros_analyst ปีที่แล้ว

    Amazing.
    Congrats for the video

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad you like the video. Thanks for watching.

  • @nirbhay_raghav
    @nirbhay_raghav 2 ปีที่แล้ว +1

    Another awesome video. It has become my favorite channel. Only regret is that I found it too late.
    Small correction. It should be 0.3s 0.08s for parquet files. You mistakenly wrote 0.3ms and 0.08ms while converting.
    Thanks.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Apprecate that you are finding my videos helpful. Good catch on finding that typo!

    • @Jay-og6nj
      @Jay-og6nj ปีที่แล้ว

      i was going to comment that, but decided to check first, least should have caught that. Good video.

  • @FranciscoPMatosJr
    @FranciscoPMatosJr 2 ปีที่แล้ว +2

    Experiment add the compression "Brotli" at the file create. The file size reduce considerably and the read is more fast a lot.
    Example:
    to save file:
    from pyarrow import csv, parquet
    parse_options = csv.ParseOptions(delimiter=delimiter)
    data_arrow = csv.read_csv(temp_file, parse_options=parse_options, read_options=csv.ReadOptions(autogenerate_column_names=autogenerate_column_names, encoding=encoding))
    parquet.write_table(data_arrow, parquet_file + '.brotli', compression='BROTLI')
    to read file: pd.read_parquet(file, engine='pyarrow')

    • @robmulla
      @robmulla  ปีที่แล้ว

      Oh. Very cool I need to check that out.

  • @haidar9798
    @haidar9798 วันที่ผ่านมา

    the parquet really help me , i was super confuse deal with such a big data , i try any way to reduce the size but never got the best accuracy and fast respons but this one really cahnge it

  • @steven7639
    @steven7639 ปีที่แล้ว +1

    Fantastic video

    • @robmulla
      @robmulla  ปีที่แล้ว

      Fantastic comment. 😎

  • @againstthegrain5914
    @againstthegrain5914 2 ปีที่แล้ว +2

    Hey this was very useful to me thank you for sharing!!

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      So glad you found it useful.

  • @vladvol855
    @vladvol855 ปีที่แล้ว

    Hello! Very interesting! Thank you! Can you please tell me is any limitation for a DF to save in parquet in terms of number of columns? Excel allow around 16-17k columns to save! Thank you for the answer!

  • @safsaf2k
    @safsaf2k ปีที่แล้ว

    This is excellent, thank you man

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad it helped!

  • @vigneshwarselva9276
    @vigneshwarselva9276 ปีที่แล้ว +1

    Was very useful, thanks much

    • @robmulla
      @robmulla  ปีที่แล้ว

      Thanks! Glad you learned something new.

  • @bkcy18
    @bkcy18 4 หลายเดือนก่อน

    Amazing content!

  • @ibaha411
    @ibaha411 ปีที่แล้ว +1

    Is there a way you could make those dataframes editable where user could change the value?

    • @robmulla
      @robmulla  ปีที่แล้ว

      There are ways but not really what they are intended for. More so bulk data analysis

  • @Zoltag00
    @Zoltag00 2 ปีที่แล้ว +7

    Great video - It would have been good to at least mention the downsides to pickle and also the built in compatibility with zip files. Haven't come across feather before, will try it out

    • @robmulla
      @robmulla  2 ปีที่แล้ว +5

      Great point! I did forget to mention that pandas will auto-unzip. I still like parquet the best.

    • @Zoltag00
      @Zoltag00 2 ปีที่แล้ว +3

      @@robmulla - Agreed, parquet has some serious benefits
      You know it also supports a compression option? Use it with gzip to see your parquet file get even smaller (and you only need to use it on write)

  • @LissetteBF
    @LissetteBF ปีที่แล้ว +1

    Interesting video!.. thanks.. i tried to compress several csv into one parquet, but I had several problems with datetime ISO 8601 with time zone, I just couldn't change the format after all my eforts, I had to continue using csv as it didn't have problems transforming it into to_datetime, any suggestions for compressing files without having problems with datetime?, thanks!

    • @robmulla
      @robmulla  ปีที่แล้ว

      Oh yes! I've had this same problem and it can be really annoying. Have you made sure you updated apache arrow to the latest version? stackoverflow.com/questions/58854466/best-way-to-save-pandas-dataframe-to-parquet-with-date-type

  • @i-Mik
    @i-Mik ปีที่แล้ว +1

    It's useful for me, thanks a lot!

    • @robmulla
      @robmulla  ปีที่แล้ว

      Happy to hear that!

  • @ruckydelmoro2500
    @ruckydelmoro2500 ปีที่แล้ว +1

    Can i do this for building text recognition? how to save and read the data if it's image?

    • @robmulla
      @robmulla  ปีที่แล้ว

      Not sure what you mean. That's possible but not really related to this. There are other ways of saving text and images that would be better.

  • @gregory8988
    @gregory8988 ปีที่แล้ว +1

    Rob, could you explain how to add contents with internal links to paragraphs of the jupyter notebook?

    • @robmulla
      @robmulla  ปีที่แล้ว

      I think you use something like this in the markdown: [section title](#section-title) check this link: stackoverflow.com/questions/28080066/how-to-reference-a-ipython-notebook-cell-in-markdown

    • @gregory8988
      @gregory8988 ปีที่แล้ว

      @@robmulla clear explanation, thank you!

  • @EVL624
    @EVL624 2 ปีที่แล้ว +1

    Very good and informative video

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      So nice of you. Thanks for the feedback.

  • @luketurner314
    @luketurner314 ปีที่แล้ว +1

    3:34 another example: if the data is going to be read in via JavaScript (like on a website) then wouldn't JSON be the best option?

    • @robmulla
      @robmulla  ปีที่แล้ว

      JSON has a lot of great use cases, and that is probably one of them. This video if more focused on large datasets to be processed in bulk. JSON is great for data that is less structured.

  • @mint9121
    @mint9121 5 หลายเดือนก่อน

    Great comparison, thanks

  • @ashishkharangate1110
    @ashishkharangate1110 5 หลายเดือนก่อน

    When I import csv files always column names cause problems during query how to handle it do I change column names

  • @ChrisHalden007
    @ChrisHalden007 ปีที่แล้ว +1

    Great video. Thanks

    • @robmulla
      @robmulla  ปีที่แล้ว

      You are welcome!

  • @aaronsayeb6566
    @aaronsayeb6566 5 หลายเดือนก่อน +1

    the pickle format seems to be significantly faster (10x) than parquet in the final 5mil row test

  • @JoeMcMullin
    @JoeMcMullin 10 หลายเดือนก่อน

    Great video and content.

  • @TheRecordedLife
    @TheRecordedLife 2 ปีที่แล้ว +1

    How about ORC file format, it is also widely used to store the data.

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      I don't know much about ORC file formats but I'll look into it. I don't think pandas has a built in function for saving to it. It looks liked pandas does have a `read_orc` method though.

  • @bprods
    @bprods 2 ปีที่แล้ว +1

    what shortcut are you using to switch block from code to markdown?

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      Y for code and M for markdown. You should check out my video on setting up jupyter where I go into detail about the shortcuts I often use.

  • @ExcelPowerPythonHub
    @ExcelPowerPythonHub 2 ปีที่แล้ว +1

    If we use df to parquet. Can we see the data?

    • @robmulla
      @robmulla  2 ปีที่แล้ว

      You can’t open the data in something like excel. But you can open it with pandas.

  • @leonjbr
    @leonjbr ปีที่แล้ว +3

    Hi Rob! I love our channel. It is very helpfull. I would like to ask you a question: is HDF5 any better than all the options you showed in the video?

    • @robmulla
      @robmulla  ปีที่แล้ว

      Good question. I didn't cover it because I thought it's an older, lesser used format.

    • @leonjbr
      @leonjbr ปีที่แล้ว +1

      @@robmulla so the answer is no?

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      @@leonjbr The answer is - I don't know but probably not. 😁

    • @leonjbr
      @leonjbr ปีที่แล้ว

      @@robmulla ok thanks.

    • @CoolerQ
      @CoolerQ ปีที่แล้ว +1

      I don't know about "better" but HDF5 is a very popular data format in science.

  • @Andrew-ud3xl
    @Andrew-ud3xl 4 หลายเดือนก่อน

    I didnt know about just reading select columns in polars, wanted to see how much bigger coverting a 320mb parquet file to csv and json, csv was over 5 times and json 17.5

  • @yogiananta9674
    @yogiananta9674 ปีที่แล้ว

    awesome ! thank you for this tutorial

    • @robmulla
      @robmulla  ปีที่แล้ว

      You're very welcome! Share with a friend.

  • @duoasch
    @duoasch ปีที่แล้ว

    what was the purpose of generating a new dataset per test, couldnt you run the save load functions seperately to the dataset generation?

  • @codeman99-dev
    @codeman99-dev ปีที่แล้ว +2

    Hey, just want to mention that when you wrote the pickle file to disk, you did so with no compression. While other formats have compression by default.

    • @robmulla
      @robmulla  ปีที่แล้ว

      Good point. I guess it would be slower and smaller if compressed.

    • @franciscoborn2859
      @franciscoborn2859 ปีที่แล้ว

      you can compress in parquet too with the compression parameter =D, at is even smaller

  • @marcosissler
    @marcosissler 6 หลายเดือนก่อน

    What about saving and reading massive data from PostgreSQL? I have some troubles loading csv to DB using psql. Them changed to pgloader, so much fast and without errors, but still super slow. Like 30min for 1.7mi rows.

  • @baharehbehrooziasl9517
    @baharehbehrooziasl9517 ปีที่แล้ว

    When we create a parquet dataset, can we dummycode the columns?

  • @Patrick-hl1wp
    @Patrick-hl1wp ปีที่แล้ว

    super awesome tricks, thank you

    • @robmulla
      @robmulla  ปีที่แล้ว

      Glad you like them! Thanks for watching.

  • @abhisekrana903
    @abhisekrana903 10 หลายเดือนก่อน +1

    stumbled on to this awesome video and absolutely loved it. Just out of curiosity - what tool are you using for making Jupyter notebook with themes especially dark theme?

    • @robmulla
      @robmulla  10 หลายเดือนก่อน

      Glad you enjoyed the video. I have a different video that covers my jupyter setup including theme: th-cam.com/video/5pf0_bpNbkw/w-d-xo.html

  • @DevenMistry367
    @DevenMistry367 ปีที่แล้ว +2

    Hey Rob, this was a really nice video! Can you please make a tutorial where you try to write this data to a database? Maybe sqlite or postgres? And explain bottlenecks? (Optional: with or without using an ORM).

    • @robmulla
      @robmulla  ปีที่แล้ว +1

      I was actually working on just this type of video and even looking at stuff like duckdb where you can write SQL on parquet files.

  • @sangrampattnaik744
    @sangrampattnaik744 ปีที่แล้ว +1

    Very nice explanation. Can you compare Dask and PySpark ?

  • @getolvid5468
    @getolvid5468 ปีที่แล้ว

    Great comparing, thanks, not sure if feather/pickle files i'm creating from Julia's script use some compression - none that i'm specifying out of the box .. but happens that the pickle files always end up being 1/2 the size smaller than the feather ones.
    (havent compared those 2 to a parquet made file)

  • @giancarloronchi9832
    @giancarloronchi9832 2 ปีที่แล้ว +1

    Hi all ! I'm a newbe, some difficulties using %time, seems not a standard python command, but only for some specific python release. Did I do something wrong ? Thanks !

    • @robmulla
      @robmulla  ปีที่แล้ว

      It only works in jupyter. Otherwise check out the timeit package.