Nice explanation! In the mentioned example I can see the Repartiton(2) created partition of unequal size(one with 8 record and another with 2records), but I expect it to be of almost equal size.
@Sathish I also had the same doubt when watching the video. repartition(2) created partitions of unequal size but coalesce(2) had partitions with each 5 records per partition. Got me confused. @Raja sir, please clarify on the same.
@@riyazalimohammad633 Your understanding is right. Repartition always creates evenly distributed partitions (as I explained in the video) whereas Coalesce produces unevenly distributed partitions. In this example, we used very simple (almost negligible size) dataset so we can not realize that. But when we work in actual big data projects, it is very evident to see this difference. Thanks for your comment
@Sathish, Sorry for late reply. Your understanding is right. Repartition always creates evenly distributed partitions (as I explained in the video) whereas Coalesce produces unevenly distributed partitions. In this example, we used very simple (almost negligible size) dataset so we can not realize that. But when we work in actual big data projects, it is very evident to see this difference. Thanks for your comment
Vey nice Video Sir, I clear all the basics doubt of Partitioning. Hope to see video on Optimizations approach like cache , persist, z order. Thanks again
Great, Sir... 1. What is the maximum value that can be set to - maxPartitionBytes.... 2. What parameters should be considered to decide the partitionbytes , repartition count... Thank you, Sir...
Hi sir, I was looking for a complete pyspark series with more emphasis on architecture and its components. I am having a good learning time with your TH-cam series on pyspark. I was wondering if I can get the slides for this course which can help me in referring back quickly when attending interviews.
All snappy files are not splittable. Snappy with parquet/avro are splittable but snappy with json is not splittable. We can't generalise that all snappy files are splittable or non-splittable
****************************** 1.Performance Tuning ***************************************** 1.Performance Optimization | Repartition vs Coalesce Performance Optimization | Repartition vs Coalesce --spark is know for its speed,speed comes from concept of parallel computing , parallel computing comes from repartition --partition is the key for parallel processing --if we design the partition ,automatically improves the performance -- hence partition plays an important role in error handling,debugging,performance --while partiotioning we must know 1.right size of partition done --scenario -2 partitons done 1000 MB,10 MB ,one with 10 MB will execute faster and remain idle which is not good. 2.right number of partitions -- scenario - we have 16 core executors, only 10 partitions created Then : 1.out of 16 cores , 10 cores will pick each partition Hence partitions cannot be shared among cores,6 cores are remaining idle. hence right number of partitions must be chosen as 6 are idle here. 2.choose 16 partitions or multiples of core available atleast.In 1rst iteration all 16 cores will pick 16 partitions and in 2nd iterations 16 cores will pick next 16 partitions.hence here no idle cores present. Spark.default.parallelism Spark.default.parallelism was introduced with RDD hence this property is only applicable to RDD. The default value for this configuration set to the number of all cores on all nodes in a cluster, on local, it is set to the number of cores on your system.For RDD, wider transformations like reduceByKey(), groupByKey(), join() triggers the data shuffling. Default value is 8, it creates 8 partitions by default. spark.sql.files.maxPartitionBytes When data is to be read from external tables,partitions are created on this above parameter. The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC. Default size is 128 MB The above 2 parameters are configurable depending upon on your need. DataFrame.repartition() pyspark.sql.DataFrame.repartition() method is used to increase or decrease the RDD/DataFrame partitions by number of partitions or by single column name or multiple column names. This function takes 2 parameters; numPartitions and *cols, when one is specified the other is optional. repartition() is a wider transformation that involves shuffling of the data hence, it is considered an expensive operation. Key Points • repartition() is used to increase or decrease the number of partitions. • repartition() creates even partitions when compared with coalesce(). • It is a wider transformation. • It is an expensive operation as it involves data shuffle and consumes more resources. • repartition() can take int or column names as parameter to define how to perform the partitions. • If parameters are not specified, it uses the default number of partitions. • As part of performance optimization, recommends avoiding using this function. coalesce() --Spark DataFrame coalesce() is used only to decrease the number of partitions. --This is an optimized or improved version of repartition() where the movement of the data across the partitions is fewer using coalesce(). --Coalesce() doesnot require a full shuffle as coalesce() combines few partitions or shuffles data only from few partitions thus avoiding full shuffle. --Due to partition merge it produces uneven size of partitions
hello sir, i have a small doubt, when we are supplying 3 separate files in a single df at 14:03 , then why the number of partitions is 3 , when the default partition size is 128 mb given the fact that the size of the df containing the 3 files is a lot less than 128 mb.
I see executors 1 and 2 in the picture before coalesce or repartition but post the action, I see both of them as executor 1. Is this pictorially wrong or does this operation reduces the num of executors as well.
QQ- The changes for default partition size will be at the cluster level or it will be only implemented for the notebook only. In case other jobs are running on cluster than will those also be impacted by the change in settings.
Hello Raja Sir, few days before i gave interview in that they asked a question like if we want to create 1 partition from multiple partition then which method you will choose coalesce or repartition ? i answered coalesce but they said we will use repartition. is it correct ?
Hi Suresh, In this case, number of partitions needs to be reduced. Coalesce and repartition both can be used to reduce number of partitions but choosing one of them is highly depending on the use case. So you should have asked more input from the interviewer to understand the use case better. If so many transformations would be applied after resizing the partition, repartition would be better choice. Otherwise coalesce is better choice
Raja Sir, Very well explained with example. I would like to know in the pictures you have given 2 executers for repartition and coalesce, but in the same picture you have shown output you named it as executer1 for both. is it by mistake or didn't i understood properly. Could you please clarify. this is difference is there in both the slides. -- Thank you Vydu
18:43 - When you decreased the number of partitions from 20 to 2 using repartition which is supposed to create evenly distributed data then why 1 partition contained 8 records and the other only 2, this is uneven data distribution?
one doubt...! when we have use repartition(2), then we got unevenly distributed partitions.'ie 8 in 1st partition and 2 in the other. but repartition should give us evenly distributed partition right? Please help me understand.
Hi Vamsi, good question. Data is getting evenly distributed in repartition. Here we can see some differences because of small data set. From spark point of view, 2 rows or 8 rows are almost same. We can see the difference between repartition and coalesce while dealing with huge amount of data like billion or millions of rows
Good evening recently I came across with a question in capgemini client interview. Consider a scenario 2 gb of file is distributed in hadoop. After doing some transformations we got 10 dataframe. By applying the repartition(1) all the data is sits in one dataframe the dataframe size is 1.8 gb but your data node size is 1gb only. Does this 1.8 gb will sit in the data node or not. If yes how? Uf no what error it willbe given Requesting you sir please tell me the answer for this question
It was good session is there any indication keyword to set for increase or decrease partition in repartition?. Because repartition (20) how will we know its increased or decreased?. After execution only will come to know its increased/decreased.
Hi sir, I was great explanation and good to see the practical implementation of it. But the only question is theoritically it was said that repartition will evenly distribute the data and coalesce will unevenly distribute the data. we it was practically implemented, I saw opposite results coalesce is taking evenly distrubuted values in two partitions but repartition doesn't. Can you please check ?
Good afternoon sir Requesting you to answer this question sir which I recently faced in interview sir please Consider you have read 1GB file into a dataframe. The max partition bytes configuration is set to 128MB. you have applied the repartition(4) or coalesce (4) on the dataframe any of the methods will decrease the number of partitions.If you apply the repartition(4) or coalesce (4) Partition size gets increase >128MB . but the max Partition bytes is configured to 128MB. Does it throws any error (or) not throws any error? If it throws an error what is the error we will get when we execute the program? If not what is the behaviour of spark in this scenario? Could you tell me the answer for this question sir. Recently I faced this question. Requesting you sir please
The configuration 'maxPartitionBytes' is playing the role while ingesting data from external system into spark memory. Once data is loaded into spark memory, the partition size can vary according to various transformation and has nothing to do with maxPartitionBytes. So in this case, it wont through any error. Coalesce would produce unevenly distributed partitions, where repartition would create evenly distributed partitions in this case. Hope it clarifies your doubts. Thanks for sharing your interview experience. others can be benefitted in this community
@@rajasdataengineering7585 thank you very much sir. I understand. If possible please do a video on this question sir. So we get more understanding visually sir. If possible please do it sir 🙏
sc.defaultParalellism is for RDD's and wil only work with RDD ? spark.sql.shuffle.partitions was introduced with DataFrame and it only works with DataFrame ?
I have never seen any video elaborated like this..Appreciate you really..It understands very clearly
Thank you
Follow this playlist , it is tremendous sir and you provide concepts in a very good way. Thank you sir.
Thank you Akshay
Wow.. Wonderful Delivery sir...!!!! A wonder content
Thanks Shailesh!
Nice explanation! In the mentioned example I can see the Repartiton(2) created partition of unequal size(one with 8 record and another with 2records), but I expect it to be of almost equal size.
@Sathish I also had the same doubt when watching the video. repartition(2) created partitions of unequal size but coalesce(2) had partitions with each 5 records per partition. Got me confused.
@Raja sir, please clarify on the same.
@@riyazalimohammad633 Your understanding is right. Repartition always creates evenly distributed partitions (as I explained in the video) whereas Coalesce produces unevenly distributed partitions. In this example, we used very simple (almost negligible size) dataset so we can not realize that. But when we work in actual big data projects, it is very evident to see this difference. Thanks for your comment
@Sathish, Sorry for late reply. Your understanding is right. Repartition always creates evenly distributed partitions (as I explained in the video) whereas Coalesce produces unevenly distributed partitions. In this example, we used very simple (almost negligible size) dataset so we can not realize that. But when we work in actual big data projects, it is very evident to see this difference. Thanks for your comment
@@rajasdataengineering7585 Thank you for your prompt response! Much appreciated.
I just watched the video and had the exact same doubt. But Raja Sir already provided the answer
Very Informative. Way better then paid courses
Thank you!
That was amazingly explained! You rock!
Glad it was helpful!
Very helpful content, thank you!
Glad it was helpful! You are welcome
Great explaination...🎉🎉🎉
Glad you liked it! Keep watching
This is the video I was searching for .. thanks a lot ❤
Thanks Arindam!
Vey nice Video Sir, I clear all the basics doubt of Partitioning. Hope to see video on Optimizations approach like cache , persist, z order. Thanks again
Thank you Vipin. Sure, will post videos with optimization concepts such as cache, persist, Z order in delta etc.,
Very detailed explanation, sir.
Thank you Varun
Great, Sir...
1. What is the maximum value that can be set to - maxPartitionBytes....
2. What parameters should be considered to decide the partitionbytes , repartition count...
Thank you, Sir...
Lot of information inside this video and much useful and understand the how to customize the partition
Thank you
Glad it was helpful! You are welcome!
awesome tutorial
Thanks Ramandeep!
Hi sir, I was looking for a complete pyspark series with more emphasis on architecture and its components.
I am having a good learning time with your TH-cam series on pyspark.
I was wondering if I can get the slides for this course which can help me in referring back quickly when attending interviews.
Really nice 👍
Thank you Kamal
At 5:30, There is a mention that snappy and gzip both are not splittable. But snappy is splittable and can have partitions.
All snappy files are not splittable. Snappy with parquet/avro are splittable but snappy with json is not splittable.
We can't generalise that all snappy files are splittable or non-splittable
Wow amazing
Thank you Kamal
Thank you so much sir for making such great videos. I'm learning a lot of nuances and best practices for practical applications.😊🙏
Thank you for your comment!
Happy to hear that these videos are helpful to you.
well explained.Thanks
Glad it was helpful! Thanks
****************************** 1.Performance Tuning *****************************************
1.Performance Optimization | Repartition vs Coalesce
Performance Optimization | Repartition vs Coalesce
--spark is know for its speed,speed comes from concept of parallel computing , parallel computing comes from repartition
--partition is the key for parallel processing
--if we design the partition ,automatically improves the performance
-- hence partition plays an important role in error handling,debugging,performance
--while partiotioning we must know
1.right size of partition done
--scenario -2 partitons done 1000 MB,10 MB ,one with 10 MB will execute faster and remain idle which is not good.
2.right number of partitions
-- scenario - we have 16 core executors, only 10 partitions created
Then :
1.out of 16 cores , 10 cores will pick each partition Hence partitions cannot be shared among cores,6 cores are remaining idle. hence right number of partitions must be chosen as 6 are idle here.
2.choose 16 partitions or multiples of core available atleast.In 1rst iteration all 16 cores will pick 16 partitions and in 2nd iterations 16 cores will pick next 16 partitions.hence here no idle cores present.
Spark.default.parallelism
Spark.default.parallelism was introduced with RDD hence this property is only applicable to RDD. The default value for this configuration set to the number of all cores on all nodes in a cluster, on local, it is set to the number of cores on your system.For RDD, wider transformations like reduceByKey(), groupByKey(), join() triggers the data shuffling.
Default value is 8, it creates 8 partitions by default.
spark.sql.files.maxPartitionBytes
When data is to be read from external tables,partitions are created on this above parameter.
The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.
Default size is 128 MB
The above 2 parameters are configurable depending upon on your need.
DataFrame.repartition()
pyspark.sql.DataFrame.repartition() method is used to increase or decrease the RDD/DataFrame partitions by number of partitions or by single column name or multiple column names. This function takes 2 parameters; numPartitions and *cols, when one is specified the other is optional. repartition() is a wider transformation that involves shuffling of the data hence, it is considered an expensive operation.
Key Points
• repartition() is used to increase or decrease the number of partitions.
• repartition() creates even partitions when compared with coalesce().
• It is a wider transformation.
• It is an expensive operation as it involves data shuffle and consumes more resources.
• repartition() can take int or column names as parameter to define how to perform the partitions.
• If parameters are not specified, it uses the default number of partitions.
• As part of performance optimization, recommends avoiding using this function.
coalesce()
--Spark DataFrame coalesce() is used only to decrease the number of partitions.
--This is an optimized or improved version of repartition() where the movement of the data across the partitions is fewer using coalesce().
--Coalesce() doesnot require a full shuffle as coalesce() combines few partitions or shuffles data only from
few partitions thus avoiding full shuffle.
--Due to partition merge it produces uneven size of partitions
hello sir, i have a small doubt, when we are supplying 3 separate files in a single df at 14:03 , then why the number of partitions is 3 , when the default partition size is 128 mb given the fact that the size of the df containing the 3 files is a lot less than 128 mb.
I will post another video on this concept which will explain in detail
I see executors 1 and 2 in the picture before coalesce or repartition but post the action, I see both of them as executor 1. Is this pictorially wrong or does this operation reduces the num of executors as well.
Good catch. It's pictorial mistake. Repartition or coalesce is nothing to do with number of executors
excellent
Thanks Vishal! Glad you liked it
QQ- The changes for default partition size will be at the cluster level or it will be only implemented for the notebook only. In case other jobs are running on cluster than will those also be impacted by the change in settings.
Hello Raja Sir, few days before i gave interview in that they asked a question like if we want to create 1 partition from multiple partition then which method you will choose coalesce or repartition ? i answered coalesce but they said we will use repartition. is it correct ?
Hi Suresh,
In this case, number of partitions needs to be reduced. Coalesce and repartition both can be used to reduce number of partitions but choosing one of them is highly depending on the use case. So you should have asked more input from the interviewer to understand the use case better. If so many transformations would be applied after resizing the partition, repartition would be better choice. Otherwise coalesce is better choice
@@rajasdataengineering7585 thanks 🙏
Please make video on liquid clustering..
Sure will create soon
Raja Sir, Very well explained with example. I would like to know in the pictures you have given 2 executers for repartition and coalesce, but in the same picture you have shown output you named it as executer1 for both. is it by mistake or didn't i understood properly. Could you please clarify. this is difference is there in both the slides. -- Thank you Vydu
Yes it is by mistake
Helpful tips
Thank you Kamal
All credits to you sir
Thank you! Hope it helps you gaining the knowledge
In real time scenario, when we will use coalsec and when repartiotion?
thank you so much , its wonderful explanation
Thank you
18:43 - When you decreased the number of partitions from 20 to 2 using repartition which is supposed to create evenly distributed data then why 1 partition contained 8 records and the other only 2, this is uneven data distribution?
one doubt...!
when we have use repartition(2), then we got unevenly distributed partitions.'ie 8 in 1st partition and 2 in the other.
but repartition should give us evenly distributed partition right? Please help me understand.
Hi Vamsi, good question.
Data is getting evenly distributed in repartition. Here we can see some differences because of small data set. From spark point of view, 2 rows or 8 rows are almost same. We can see the difference between repartition and coalesce while dealing with huge amount of data like billion or millions of rows
@@rajasdataengineering7585 thank you for clarification..
@@rajasdataengineering7585 your videos are so good...
Thank you
@@rajasdataengineering7585 had the same question, thank you for explaining!
Good evening recently I came across with a question in capgemini client interview. Consider a scenario 2 gb of file is distributed in hadoop. After doing some transformations we got 10 dataframe. By applying the repartition(1) all the data is sits in one dataframe the dataframe size is 1.8 gb but your data node size is 1gb only. Does this 1.8 gb will sit in the data node or not. If yes how? Uf no what error it willbe given
Requesting you sir please tell me the answer for this question
In example repartition produce uneven output for 2 partition but coalesce produce even result. Please explain??
It was good session is there any indication keyword to set for increase or decrease partition in repartition?. Because repartition (20) how will we know its increased or decreased?. After execution only will come to know its increased/decreased.
in the code i see sc.parallelieze (range(100),1) , where is the reference for sc ?.
In databricks, spark context is implicit, no need to define separately
Hi sir, I was great explanation and good to see the practical implementation of it. But the only question is theoritically it was said that repartition will evenly distribute the data and coalesce will unevenly distribute the data. we it was practically implemented, I saw opposite results coalesce is taking evenly distrubuted values in two partitions but repartition doesn't. Can you please check ?
Sorry just saw the below comments. will try with larger datasets
Pls check with larger dataset and you can see the difference
Good afternoon sir
Requesting you to answer this question sir which I recently faced in interview sir please
Consider you have read 1GB file into a dataframe.
The max partition bytes configuration is set to 128MB.
you have applied the repartition(4) or coalesce (4) on the dataframe any of the methods will decrease the number of partitions.If you apply the repartition(4) or coalesce (4) Partition size gets increase >128MB . but the max Partition bytes is configured to 128MB. Does it throws any error (or) not throws any error? If it throws an error what is the error we will get when we execute the program? If not what is the behaviour of spark in this scenario?
Could you tell me the answer for this question sir. Recently I faced this question. Requesting you sir please
The configuration 'maxPartitionBytes' is playing the role while ingesting data from external system into spark memory. Once data is loaded into spark memory, the partition size can vary according to various transformation and has nothing to do with maxPartitionBytes.
So in this case, it wont through any error. Coalesce would produce unevenly distributed partitions, where repartition would create evenly distributed partitions in this case.
Hope it clarifies your doubts.
Thanks for sharing your interview experience. others can be benefitted in this community
@@rajasdataengineering7585 thank you very much sir. I understand. If possible please do a video on this question sir. So we get more understanding visually sir. If possible please do it sir 🙏
Sure Kalyan, will create a video on this requirement
👍
👍🏻
sc.defaultParalellism is for RDD's and wil only work with RDD ? spark.sql.shuffle.partitions was introduced with DataFrame and it only works with DataFrame ?