what is Apache Parquet file | Lec-7

แชร์
ฝัง
  • เผยแพร่เมื่อ 14 ต.ค. 2024
  • In this video I have talked about parquet file reading in spark. If you want to optimize your file and process in Spark then you should have a solid understanding of Parquet file format. Please do ask your doubts in comment section.
    Directly connect with me on:- topmate.io/man...
    Download Parquet Data:- github.com/dat...
    Download parquet tools in your local to run all the below commands.
    Parquet tools can be downloaded using pip command.
    Run the below command in cmd or terminal
    pip install parquet-tools
    Run the blow command inside python
    import pyarrow as pa
    import pyarrow.parquet as pq
    parquet_file = pq.ParquetFile(r'C:\Users
    ikita\Downloads\Spark-The-Definitive-Guide-master\data\flight-data\parquet\2010-summary.parquet\part-r-00000-1a9822ba-b8fb-4d8e-844a-ea30d0801b9e.gz.parquet')
    parquet_file.metadata
    parquet_file.metadata.row_group(0)
    parquet_file.metadata.row_group(0).column(0)
    parquet_file.metadata.row_group(0).column(0).statistics
    Run the below command in cmd/terminal
    parquet-tools show C:\Users\manish\Downloads\Spark-The-Definitive-Guide-master\data\flight-data\parquet\2010-summary.parquet\part-r-00000-1a9822ba-b8fb-4d8e-844a-ea30d0801b9e.gz.parquet
    parquet-tools inspect (path of your file location as above)
    parquet.apache...
    For more queries reach out to me on my below social media handle.
    Follow me on LinkedIn:- / manish-kumar-373b86176
    Follow Me On Instagram:- / competitive_gyan1
    Follow me on Facebook:- / manish12340
    My Second Channel -- / @competitivegyan1
    Interview series Playlist:- • Interview Questions an...
    My Gear:-
    Rode Mic:-- amzn.to/3RekC7a
    Boya M1 Mic-- amzn.to/3uW0nnn
    Wireless Mic:-- amzn.to/3TqLRhE
    Tripod1 -- amzn.to/4avjyF4
    Tripod2:-- amzn.to/46Y3QPu
    camera1:-- amzn.to/3GIQlsE
    camera2:-- amzn.to/46X190P
    Pentab (Medium size):-- amzn.to/3RgMszQ (Recommended)
    Pentab (Small size):-- amzn.to/3RpmIS0
    Mobile:-- amzn.to/47Y8oa4 ( Aapko ye bilkul nahi lena hai)
    Laptop -- amzn.to/3Ns5Okj
    Mouse+keyboard combo -- amzn.to/3Ro6GYl
    21 inch Monitor-- amzn.to/3TvCE7E
    27 inch Monitor-- amzn.to/47QzXlA
    iPad Pencil:-- amzn.to/4aiJxiG
    iPad 9th Generation:-- amzn.to/470I11X
    Boom Arm/Swing Arm:-- amzn.to/48eH2we
    My PC Components:-
    intel i7 Processor:-- amzn.to/47Svdfe
    G.Skill RAM:-- amzn.to/47VFffI
    Samsung SSD:-- amzn.to/3uVSE8W
    WD blue HDD:-- amzn.to/47Y91QY
    RTX 3060Ti Graphic card:- amzn.to/3tdLDjn
    Gigabyte Motherboard:-- amzn.to/3RFUTGl
    O11 Dynamic Cabinet:-- amzn.to/4avkgSK
    Liquid cooler:-- amzn.to/472S8mS
    Antec Prizm FAN:-- amzn.to/48ey4Pj

ความคิดเห็น • 118

  • @manish_kumar_1
    @manish_kumar_1  8 หลายเดือนก่อน +33

    I said 500 GB in the video by mistake. It is supposed to be 500MB, and when dividing 500/128, we will get 4 partitions.

  • @kamalprajapati9955
    @kamalprajapati9955 8 วันที่ผ่านมา

    This tutorial is too good. What a detailed demo based insight. I could never forget this anymore. Thank you for your efforts. These tutorials are literal Gold.

  • @Shubhamkumar-cq5wt
    @Shubhamkumar-cq5wt ปีที่แล้ว +5

    Literally the best and most detailed video on parquet file format on yt. Thank you!

  • @krishnasahoo6172
    @krishnasahoo6172 ปีที่แล้ว +1

    Wah....Itta clarification...maza aa gya...Video kab khatm hui pta hi ni chala....!!! Excellent explanation.

  • @roshniagrawal4777
    @roshniagrawal4777 2 หลายเดือนก่อน

    Such a detailed and amazing video , I am working in big data from many years but this level of detailing I never knew , thankyou so much for this detailed video, your way of teaching encourages/excites many learners. hats off

  • @akashprabhakar6353
    @akashprabhakar6353 5 หลายเดือนก่อน

    Predicate pushdown - Rows filtering, Projection Pruning/Pushdown - Column filtering. Thanks for the session bro!!

  • @bidyasagarpradhan2751
    @bidyasagarpradhan2751 9 หลายเดือนก่อน

    Someone ask me in interview about internals of parquet file format and i couldn't answer it,Then i found your video.Now i can explain easily.Best video on parquet file format.

  • @ApoorvaShinde-on4ep
    @ApoorvaShinde-on4ep 4 หลายเดือนก่อน

    This is so far the best video in which I got to know in depth knowledge of parquet and very easy to understand.
    Thankyou so much for sharing your knowledge.!!
    Could you please share the video having optimization of parquet?

  • @alokkumarmohanty8454
    @alokkumarmohanty8454 ปีที่แล้ว +2

    Hi Manish,
    the parquet file detail class was classic example for how to present something .if same like this avro and orc file format classes can be discussed then it would be really helpful. Nowadays the interviewer is asking on those as well

  • @divyanshusingh3966
    @divyanshusingh3966 9 วันที่ผ่านมา

    Thank you bro you are helping a lot of DEs good work 🎉

  • @RiyaBiswas-r1p
    @RiyaBiswas-r1p 5 หลายเดือนก่อน

    Never saw such a detailed video for parquet file, these videos are really valuable. Really appreciate the efforts put in making these videos

  • @sahillohiya7658
    @sahillohiya7658 11 หลายเดือนก่อน +1

    I love how indept you are going, please keep doing it ! We are loving it.

  • @shubhamwaingade4144
    @shubhamwaingade4144 8 หลายเดือนก่อน +1

    The best explanation!!! Your videos are giving me motivation and inspiration to keep learning spark!

  • @ProgramwithVishal
    @ProgramwithVishal 2 หลายเดือนก่อน

    You are like gold mine in terms of knowledge

  • @dishant_22
    @dishant_22 ปีที่แล้ว

    This is the best explanation for parquet file format available online. Thanks Manish.

  • @lakkilakki772
    @lakkilakki772 ปีที่แล้ว +1

    Hi Manish, great explanation of parquet i'm using parquet but didn't know about these features which made things fast how were you able to learn all this knowledge please suggest any documentation/resources to get deep understanding like this. you made my day. Thank you 😊

  • @ArunNair-z3m
    @ArunNair-z3m 3 หลายเดือนก่อน

    Hi Manish, thanks for such smooth explanation of not just information related to parquet but also things related to it, kudos to your efforts :D

  • @PrabhakarKumarJha-g8t
    @PrabhakarKumarJha-g8t หลายเดือนก่อน

    You are really awesome. Thank you. You are adding an infinite value.

  • @ankitachauhan6084
    @ankitachauhan6084 4 หลายเดือนก่อน

    the best explanation ! you are a wonderful teacher

  • @amitpatel9670
    @amitpatel9670 19 วันที่ผ่านมา

    hey Manish been following your channel for a very long time. And thanks for the awesome videos. In the ending of this video. You said you will discuss how to optimize parquet file format but I dont see that video added in this playlist. Am I missing something?

    • @coolashishful
      @coolashishful 14 วันที่ผ่านมา

      this video only, last 10 mins.

  • @sunils9143
    @sunils9143 2 หลายเดือนก่อน

    Highly informative video, super!

  • @rahulgupta-po4ki
    @rahulgupta-po4ki ปีที่แล้ว

    highly informative and detailed video on parquet. Thanks a lot Manish!

  • @AyushMandloi
    @AyushMandloi 3 หลายเดือนก่อน

    Also please explain Bucketing and partitioning

  • @chiranjivmansis1415
    @chiranjivmansis1415 2 หลายเดือนก่อน

    Awesome explanation

  • @NirajAgrawal-e6v
    @NirajAgrawal-e6v ปีที่แล้ว

    Please make a video on avro file format in detail because I faced challenges when interviewers asked about avro file format questions

  • @dollykushwah6352
    @dollykushwah6352 ปีที่แล้ว

    Hello Manish, excellent explanation, hats off to you. When will you give optimization video on parquet eagerly waiting for it

  • @krunalsuthar1420
    @krunalsuthar1420 3 หลายเดือนก่อน

    Please make video on ORC and Avro as well

  • @susanthomas223
    @susanthomas223 4 หลายเดือนก่อน

    Thank you so much for putting in so much time for making this video

  • @asifquasmi4538
    @asifquasmi4538 7 หลายเดือนก่อน

    Hats of Manish, Please keep doing the good work :)

  • @AnubhavTayal
    @AnubhavTayal ปีที่แล้ว +1

    Hi Manish, thank you for the information. Please can you elaborate whats the default value of 128 MB and when we have 500 GB data how does that convert to 4 row groups? Thank you

    • @manish_kumar_1
      @manish_kumar_1  ปีที่แล้ว +3

      500 mb not gb. 500 divided by 128 I.e 4.
      4 block of data will be created. So the thing is we have a default block size of 128 MB in hdfs and multiple cloud service provider also use the same block size. So let say if you have 140 mb data that means one partition will be of 128 mb and next partition will be having just 12 mb of data.

    • @AnubhavTayal
      @AnubhavTayal ปีที่แล้ว

      @@manish_kumar_1 thank you so much!

  • @nikhiljain8411
    @nikhiljain8411 4 หลายเดือนก่อน

    How one will understand in which 1L records we need to fetch the data. Still we need to scan the complete file. Isn't it?
    Kindly explain

  • @natarajbeelagi569
    @natarajbeelagi569 หลายเดือนก่อน

    Wow super info

  • @tnmyk_
    @tnmyk_ 8 หลายเดือนก่อน

    Where is the nested JSON video? You said you will make a separate video on it in the previous lecture "how to read json file in Pysaprk"

  • @pankajsolunke3714
    @pankajsolunke3714 ปีที่แล้ว +1

    Hi manish sir,Thanks for bringing such valuable info ..I have a question like how can we handle schema evaluation in parquet

    • @manish_kumar_1
      @manish_kumar_1  ปีที่แล้ว

      I didn't get you

    • @mohitdaxini3067
      @mohitdaxini3067 ปีที่แล้ว

      I think he wants to about shema evolution

    • @sankuM
      @sankuM ปีที่แล้ว

      @@manish_kumar_1 I think @pankajsolunke3714 is asking how to handle schema evolution in parquet if we can?

  • @avisinha2844
    @avisinha2844 ปีที่แล้ว +1

    Hello Manish, i really like your videos, thanks for the efforts you put in. Have a question, can you please tell a good tutorial/course that we can go through to get really good at pyspark, if not a single resource then what are the various resources that we can go through to get good at pyspark coding.

    • @manish_kumar_1
      @manish_kumar_1  ปีที่แล้ว +3

      You don't need a course. Still if you want to go for a course then you can buy a udemy course by Prashant Pandey titled pyspark for Beginner. Rest depends on you ki how much questions you want to solve. Solve more problems rather than running behind multiple courses. Practice is the key to success not a number of course you have done.

    • @sankuM
      @sankuM ปีที่แล้ว

      sparkbyexamples is the RESOURCE we need for practice!! :)

    • @royalkumar7658
      @royalkumar7658 ปีที่แล้ว

      Where can we practice spark from?

    • @royalkumar7658
      @royalkumar7658 ปีที่แล้ว

      ​@@manish_kumar_1 where can we practice spark from??

    • @manish_kumar_1
      @manish_kumar_1  ปีที่แล้ว

      @@royalkumar7658 Leet code se. Aap playlist start se follow kijiye tab Pata chal jayega kaha se and kaise

  • @Wandering_words_of_INFJ
    @Wandering_words_of_INFJ 11 หลายเดือนก่อน

    Manish, if we are writing parquet by making the files already sorted in asc or desc then the process of retrieval of data would be faster right? Because in row_number's meta data would have min and Max value in a certain range? Please correct me if I am wrong.

  • @gitanjalimadaan537
    @gitanjalimadaan537 18 วันที่ผ่านมา

    Very good video!

  • @shubhamwaingade4144
    @shubhamwaingade4144 8 หลายเดือนก่อน

    One doubt, I did not understand the logical partitioning completely, it resembles with file size we can set in spark config. Please help me understand it

  • @harshtalwar9615
    @harshtalwar9615 2 หลายเดือนก่อน

    Superb bro … very helpful thanks 🙏🏻

  • @navjotsingh-hl1jg
    @navjotsingh-hl1jg ปีที่แล้ว

    bhai 500gb data mein 4 row kyun rakhe gaye and manish bhai 128mb hota hai . aap explain kar sakte ho aisa kyun bhai

  • @MCAMadeEasy
    @MCAMadeEasy 6 หลายเดือนก่อน

    Manish bhai, nested json?

  • @Marcopronto
    @Marcopronto ปีที่แล้ว

    Hi Manish,
    In the last video, you told that u will explain about nested json in further videos. Where can i find that?

    • @manish_kumar_1
      @manish_kumar_1  ปีที่แล้ว

      I have not done yet. I will try to make one soon.

    • @220piyush
      @220piyush 5 หลายเดือนก่อน

      Yaar bhaiya wo bana do pls. Industry to usi pe chal ri

  • @nileshgodase1007
    @nileshgodase1007 9 หลายเดือนก่อน

    Nested json to data frame explain kijiyee na

  • @patilsahab4278
    @patilsahab4278 9 หลายเดือนก่อน

    hii bro each row grop stores 128mb or 128 gb data you told 128mb
    bur for for 500gb you told 4 row groups
    you are talking about 500mb or 500gb

  • @sumitchoubey1284
    @sumitchoubey1284 4 หลายเดือนก่อน

    unable to install parquet-tools. can you help or point n right direction

  • @shivakrishna1743
    @shivakrishna1743 ปีที่แล้ว

    Very detailed awesome video!! Thanks

  • @neelshah8247
    @neelshah8247 7 หลายเดือนก่อน

    Excellent video. Thank you :)

  • @dineshboliwar9545
    @dineshboliwar9545 ปีที่แล้ว

    sir please make short video to downloadd and install parquet tool

  • @shadabalam17
    @shadabalam17 หลายเดือนก่อน

    Do i need to install python first before downloading parquet tool?

  • @prathamesh_a_k
    @prathamesh_a_k 10 หลายเดือนก่อน

    nice explaination brother

    • @prathamesh_a_k
      @prathamesh_a_k 10 หลายเดือนก่อน

      can you make one video on ORC also

  • @rpraveenkumar007
    @rpraveenkumar007 ปีที่แล้ว

    Hi Manish, what is projection pruning? Unable to find it on Google. Or is it Partition Pruning*? Can you please explain/clarify?

    • @manish_kumar_1
      @manish_kumar_1  ปีที่แล้ว

      Projection pushdown Hota hai jisme columns ki pruning hoti hai. So Projection pushdown ya Projection pruning same hai.

    • @rpraveenkumar007
      @rpraveenkumar007 ปีที่แล้ว

      @@manish_kumar_1 thanks for clarifying!

  • @deeksha6514
    @deeksha6514 7 หลายเดือนก่อน

    Thanks! for this masterpiece

  • @vaibhavshanbhag5016
    @vaibhavshanbhag5016 2 หลายเดือนก่อน

    @manish_kumar_1 Sir kya mast content banaya, maja aa gaya, thank you!

  • @prashantmane2446
    @prashantmane2446 2 หลายเดือนก่อน

    error occured while processing file ??
    yeh error continous hai...help:

  • @vaibhavmore7936
    @vaibhavmore7936 ปีที่แล้ว

    Thanks for this Manish! Great Work!

  • @gouravchourasia9515
    @gouravchourasia9515 25 วันที่ผ่านมา

    You should have spent more time and detail on projection pruning and pushdown

  • @lucky_raiser
    @lucky_raiser ปีที่แล้ว

    bhai, maza aa gya, thanks bro

  • @vipulbornare34
    @vipulbornare34 2 หลายเดือนก่อน

    Thankyou 😊

  • @SanjayShukla-qh8xj
    @SanjayShukla-qh8xj หลายเดือนก่อน

    Bro nested json video is not available. Please provide in depth if feasible.

    • @manish_kumar_1
      @manish_kumar_1  29 วันที่ผ่านมา

      It is there. Practical wale playlist me hoga. Ek baar check kar lijiye

  • @akashchandapureakashchanda1842
    @akashchandapureakashchanda1842 11 วันที่ผ่านมา

    bro you have installed java to read parquet file in command prompt

    • @manish_kumar_1
      @manish_kumar_1  10 วันที่ผ่านมา

      Not particularly. I had in my laptop already installed

  • @ashutoshkumarsingh3337
    @ashutoshkumarsingh3337 ปีที่แล้ว

    what a gem you are

  • @pankajjagdale2005
    @pankajjagdale2005 ปีที่แล้ว

    informative Thanks

  • @royalkumar7658
    @royalkumar7658 ปีที่แล้ว

    Null kaise write hota hai disk pe??

  • @debopower2009
    @debopower2009 ปีที่แล้ว

    Very nice.

  • @AkshayPawar-c8j
    @AkshayPawar-c8j ปีที่แล้ว

    Thanks Manish 🙂

  • @ShekharBhide
    @ShekharBhide ปีที่แล้ว

    sir, parquet file download nahi ho raha he github se

  • @aravind5310
    @aravind5310 ปีที่แล้ว

    Your content is good.Why don't you do videoes in English.

    • @manish_kumar_1
      @manish_kumar_1  ปีที่แล้ว

      english nahi aati hai 😒. Just joking, I may record a session in future but not for now.

    • @izahmad90
      @izahmad90 ปีที่แล้ว

      th-cam.com/video/zM2OAAvJItQ/w-d-xo.html&ab_channel=knowledgeEpicenter (We are making videos for those people for whom no one is making videos.)

  • @radheshyama448
    @radheshyama448 ปีที่แล้ว

    😇

  • @pramod3469
    @pramod3469 ปีที่แล้ว

    Thanks Manish

  • @ajaypatil1881
    @ajaypatil1881 11 หลายเดือนก่อน

    Example of Modi ji for finding age >18 was highlight of the video

  • @sankuM
    @sankuM ปีที่แล้ว

    Hey @manish_kumar_1, I was able to use the modes (append, overwrite, etc.) using this command:
    df.write.option("header", first_row_is_header) \
    .option("sep", delimiter) \
    .mode("Overwrite") \
    .csv(file_location)
    All other ways of writing is returning error on Databricks if the file exists.. even if we're trying to append the data..! :| Unsure why is this happening...! :\

    • @manish_kumar_1
      @manish_kumar_1  ปีที่แล้ว +1

      Same here. May be due to community edition. In production environment it does work

    • @sankuM
      @sankuM ปีที่แล้ว

      @@manish_kumar_1 oh..okay! Still weird, though!!! I'm yet to try databricks in production..

  • @ranvijaymehta
    @ranvijaymehta ปีที่แล้ว

    Thanks sir

  • @yogesh9992008
    @yogesh9992008 ปีที่แล้ว

    Cmd-parquet-tool issue

  • @khadarvalli3805
    @khadarvalli3805 ปีที่แล้ว

  • @manish_kumar_1
    @manish_kumar_1  ปีที่แล้ว

    Directly connect with me on:- topmate.io/manish_kumar25

  • @ajaywade9418
    @ajaywade9418 10 หลายเดือนก่อน

    21:25 500 GB or 500 Mb ?

    • @220piyush
      @220piyush 5 หลายเดือนก่อน

      500 mb

  • @dineshboliwar9545
    @dineshboliwar9545 ปีที่แล้ว

    anybody help me please i cant read parquet file using command prompt

    • @manish_kumar_1
      @manish_kumar_1  ปีที่แล้ว

      Koi issue nahi hai. Aap direct databricks me read kar lijiye. Ek baar video ko bas sahi se dekh lijiyega

    • @dineshboliwar9545
      @dineshboliwar9545 ปีที่แล้ว

      @manish_kumar_1 databricks me kr liya h command prompt ka nhi ho rha h

  • @yogesh9992008
    @yogesh9992008 ปีที่แล้ว

    Stage failure error show

  • @starky4910
    @starky4910 หลายเดือนก่อน

    ok sir ni manunga notebook apse
    🤥😔

  • @KavyaPristha
    @KavyaPristha 2 หลายเดือนก่อน

    Please drop your twitter or X account. I will promote you. You are the only person on youtube who is actually teaching something useful in DE filed. That TOO IN HINDI. Great work and great effort.
    God Bless You !!

    • @manish_kumar_1
      @manish_kumar_1  2 หลายเดือนก่อน

      I don't have ex😂😂. Sorry I mean this X

    • @KavyaPristha
      @KavyaPristha 2 หลายเดือนก่อน

      @@manish_kumar_1 Hahaha. Please create one than. It pays better than TH-cam

    • @manish_kumar_1
      @manish_kumar_1  2 หลายเดือนก่อน

      @@KavyaPristha oh is it. I did not know about this.

  • @DevSharma_31
    @DevSharma_31 ปีที่แล้ว

    import pyarrow as pa
    import pyarrow.parquet as pq
    parquet_file = pq.ParquetFile(r'C:\Users\DELL\Desktop\part-r-00000-1a9822ba-b8fb-4d8e-844a-ea30d0801b9e.gz.parquet')
    parquet_file.metadata
    parquet_file.metadata.row_group(0)
    parquet_file.metadata.row_group(0).column(0)
    parquet_file.metadata.row_group(0).column(0).statistics Not able to see any output with this file. Not sure why

  • @afjalahamad2465
    @afjalahamad2465 7 หลายเดือนก่อน

    really awesome explanation

  • @wellwisher7333
    @wellwisher7333 ปีที่แล้ว

    Thanks Sir