Repartition vs Coalesce | Spark Interview questions

แชร์
ฝัง
  • เผยแพร่เมื่อ 29 ม.ค. 2025

ความคิดเห็น • 44

  • @AbhishekKumar-yq5uf
    @AbhishekKumar-yq5uf 4 ปีที่แล้ว +10

    That was brilliantly explained! Thanks a lot. You're a champ

    • @DataSavvy
      @DataSavvy  4 ปีที่แล้ว +1

      Thanks Abhishek :)

  • @pankajchikhalwale8769
    @pankajchikhalwale8769 10 หลายเดือนก่อน

    Great explanation.
    Excellent teaching.
    Please do a deeper dive into coalesce and partition.Command and several scenarios (e.g. coalesce after repartition, re-partition after coalesce, coalesce after coalesce etc). I know that some of these may not be meaningful but I have seen several of your videos and you are GREAT teacher.

  • @kiranmudradi26
    @kiranmudradi26 4 ปีที่แล้ว +3

    Good video. Please cover more scenario based questions in Spark and Streaming like
    1. how to handle Failover mechanism in Spark Structured streaming application. And also cover by taking certain use cases like IOT systems.
    2. how do you build the Streaming pipeline and how will you combine streaming data with batch data? particularly what tools you choose, to build such pipeline. Will be more helpful.

    • @DataSavvy
      @DataSavvy  4 ปีที่แล้ว +1

      Sure Kiran... I will cover these as part of streaming series

    • @kiranmudradi26
      @kiranmudradi26 4 ปีที่แล้ว +1

      Thank you so much. looking forward for videos.

    • @DataSavvy
      @DataSavvy  4 ปีที่แล้ว

      Thanks Kiran :)

  • @sreehithanelluri9105
    @sreehithanelluri9105 4 ปีที่แล้ว +2

    Thanks for the content !! your new videos are really good

    • @DataSavvy
      @DataSavvy  4 ปีที่แล้ว

      Thanks... Your words are very encouraging :)

  • @praneethbhat4703
    @praneethbhat4703 3 ปีที่แล้ว

    You have explained concept beautifully. I understand clearly. Plz execute the concept while u r explaining. Can u explain when to choose what among these two .

  • @chaitanyag.8415
    @chaitanyag.8415 3 ปีที่แล้ว

    You are doing wonderful job. Thank you so much.

  • @ravindrareddyk7298
    @ravindrareddyk7298 4 ปีที่แล้ว

    It's very good explanation 👍👍👍

  • @ravikumark6746
    @ravikumark6746 4 ปีที่แล้ว +1

    Super sir.. points are so clear. Thanks

    • @DataSavvy
      @DataSavvy  4 ปีที่แล้ว

      Thanks Ravi :)

  • @anumsheraz
    @anumsheraz 3 ปีที่แล้ว

    another amazing video. Thanks :)

  • @sudharsanbabu4456
    @sudharsanbabu4456 4 ปีที่แล้ว +3

    Excellent Video!..
    Sir please clarify my doubts, for example lets imagine i have 4 data nodes(machines). Now i going to store 20 GB of data into the cluster and also I can give 4 partitions. In this situation, Data will slice and distribute all the 4 nodes or it will be stored any 2 nodes? Is there any relativity between on system cores? How do name node will manage? please clarify sir..

  • @roysumit3091
    @roysumit3091 4 ปีที่แล้ว +1

    Wonderfully explained

    • @DataSavvy
      @DataSavvy  4 ปีที่แล้ว +1

      Thanks Sumit :)

  • @RahulRawat-wu1vv
    @RahulRawat-wu1vv 4 ปีที่แล้ว +2

    Hi i have few question
    1. are there any cases in which repartition does not lead to shuffling.
    2.how to decide how many partitions to be entered into the coalesce and repartition function for optimal utilisation

    • @DataSavvy
      @DataSavvy  4 ปีที่แล้ว +3

      1. Good question, I will check on that and get back.
      2. Your partition size should be usually 128 mb.. so total file size divided by 128 is your ideal partitions number... Anything around this number is practically is fine... In some rare situations you may like to double this number by dividing total size by 64.. that is usually done to increase parallelism

    • @RahulRawat-wu1vv
      @RahulRawat-wu1vv 4 ปีที่แล้ว +1

      @@DataSavvy one doubt if i have a scenario where i have a file which quite huge suppose 200 or 300 gb will this formula be feasable as a lot of partitions will be created which. Are those many partitions feasible

    • @DataSavvy
      @DataSavvy  4 ปีที่แล้ว +2

      Yes, those many partitions are feasible... No of partitions are also based on how many parallel tasks u want to run... In lot of situations we have decided our partition should be of 256 mb... Then no of partitions will half... Moreover For large datasets, you will have partitions in thousands... That's very normal

    • @RahulRawat-wu1vv
      @RahulRawat-wu1vv 4 ปีที่แล้ว

      @@DataSavvy thanks that was quiet informative

  • @crimemastergogo5283
    @crimemastergogo5283 3 ปีที่แล้ว

    Looks like Coalesce should the one preferred over repartition whenever there is a need to reduce the no. of partitions because it looks efficient.
    Can there be a scenerio when you need to reduce the no. of partitions and still prefer repartition over coalesce?

  • @sumittiwari2713
    @sumittiwari2713 3 ปีที่แล้ว

    Thank you for sharing

  • @vikkyjambhulkar534
    @vikkyjambhulkar534 4 ปีที่แล้ว

    what is the best case to use between repartition and coalesce?
    dataset 1
    id, name
    dataset 2
    id,name
    I want to join these two datasets then what will you use Repartion or coalesce?

  • @arunasingh8617
    @arunasingh8617 2 ปีที่แล้ว

    Well Explained!

    • @DataSavvy
      @DataSavvy  2 ปีที่แล้ว

      Thank you Aruna

  • @prakashmudliyar4834
    @prakashmudliyar4834 3 ปีที่แล้ว

    How do we know that my current partition in unevenly balanced in my executoe so I have to use repartition or coalesce??

  • @vaibhavmore1059
    @vaibhavmore1059 4 ปีที่แล้ว

    Nicely explained!!
    I have few questions about joins in spark and reparation and coalesce.
    1.Suppose we have 2 dataframes Lets say dataframe ‘A’ is having 20 partition and dataframe ‘B’ is having 10 partition. After performing joins on A and B dataframe.What will the output of joined dataframe??
    2. Both reparation and coalesce can be used for decreasing number of partition numbers. Can you please explain by taking any use case.

    • @DataSavvy
      @DataSavvy  4 ปีที่แล้ว +2

      Thanks Vaibhav... 1. Spark will do inner join by default and give result 2. Will create a video for explaining a example....
      Please join our telegram group. We discuss lot of stuff there

  • @SagarSingh-ie8tx
    @SagarSingh-ie8tx 2 ปีที่แล้ว

    Veer nice 🎉

  • @sanjeev5149
    @sanjeev5149 4 ปีที่แล้ว +1

    One doubt: if reparation creates new partitions then what happens to old partitions, will they be removed or continue to use memory?

    • @prosperakwo7563
      @prosperakwo7563 4 ปีที่แล้ว +1

      The new partitions are created after shuffling the existing data hence the old partitions seize to exist.

  • @raghavapinninti7278
    @raghavapinninti7278 3 ปีที่แล้ว

    Great Video.
    How could we solve the problem that one executor have bigger partition and another executor have smaller partition for processing. In spark can we propose evenly distribution among executors?

  • @swagatdas9963
    @swagatdas9963 4 ปีที่แล้ว

    What will happen in repartion(1) vs coalesce(1)
    Thanks for the content.

    • @madhu1987ful
      @madhu1987ful 4 ปีที่แล้ว

      Same question.
      Coalesce (1) will cause shuffling right?

    • @adityavandanapu2791
      @adityavandanapu2791 3 ปีที่แล้ว

      @@madhu1987ful no shuffling in coalesce

  • @srinivasasameer9615
    @srinivasasameer9615 4 ปีที่แล้ว +1

    What about coalesce shuffle true and repartition

    • @DataSavvy
      @DataSavvy  4 ปีที่แล้ว +3

      If u set shuffle flag to true and use coalesce, then there will be shuffle... But it kills purpose of coalesce. You should use repartition if u are ok with shuffle

    • @srinivasasameer9615
      @srinivasasameer9615 4 ปีที่แล้ว +1

      @@DataSavvy thanks for your response

  • @Praveen_Kumar_R_CBE
    @Praveen_Kumar_R_CBE 4 ปีที่แล้ว +1

    Group is full sir..

    • @DataSavvy
      @DataSavvy  4 ปีที่แล้ว

      Yes Praveen.. please join telegram... We have moved to telegram

  • @harmeetsingh6619
    @harmeetsingh6619 3 ปีที่แล้ว

    Hi, i was given a question . Assume we have 2 big sql fact tables which has 50 gig each worth of data and there is 1 small dimension table. There are 20 nodes , quadcore , 32 gb each node.
    1. What configuration will be choosen in terms of executor memory, num of executors and cores.
    2. How you will create an etl pipeline for full load
    3. What changes u will do for incremental load 4. would you do any source side joins as dimension table is small ? Or how would you handle dimension table assuming this table is required to be joined to big fact table?
    Can you please share your thoughts on this?