Let's say if you are having two large dataframe , then How will you optimize the join ? And why are you using the rdd as it's very slow as compared to dataframe ?
small2 is not defined. Also why is the shuffle cost of partitioning the 2 RDDs separately lower than the shuffle cost of joining them directly? They are basically doing the same thing, moving data of the same join key to a same executor.
shuffle during join OR doing repartition before join .... u r saying that the second one is better.... right? Whats the difference? u have not mentioned why is it better... some one has to take care of repartitioning -> either join will shuffle or we have to repartition -> its fine... pls let us know why this approach is better.
your example is not up to the mark, whatever you describe in your lecture it is not understandable, only the shake of creating a video you do this, I did not get your point whatever you told us regarding the join how it happens and what happens please describe in a much better understandable manner.
That's what i was looking for. It's a great help Viresh
i m really happy with your in deep dive spark. Thank you.
Let's say if you are having two large dataframe , then How will you optimize the join ? And why are you using the rdd as it's very slow as compared to dataframe ?
small2 is not defined. Also why is the shuffle cost of partitioning the 2 RDDs separately lower than the shuffle cost of joining them directly? They are basically doing the same thing, moving data of the same join key to a same executor.
thanks Veresh , here "rdd"s been used , how to do same using Dataset/Dataframe ?? where you got "small2" from??
What's the benefit of persisting the 2 RDDs?
Any suggestions for dataframes?
I feel the title is misleading, repartitioning the 2 RDDs involves shuffle.
I have been asked like, if I have 2 tables with same volume of data but say one has 10 column and other has 3 columns, how to optimise this joining.
really nice , thanks bro , in line 14 , is it "small.partition.get" instead "small2.partition.get" right ? why shuffle.partitions set to 2 only ?
Otherwise remaining 198 partitions would be empty
@@TechWithViresh is it otherwise or other words ? want to keep 198 partitions empty ?
have been following the series, its pretty good but this video is not at all clear, you should make another with same question
What's the book/picture in the video?
Even with repartitioning we have to move data to different partitions causing a shuffle, isnt it ?
Why are you converting dataframe to rdd ?? It is very bad practice in terms of performance
shuffle during join OR doing repartition before join .... u r saying that the second one is better.... right? Whats the difference? u have not mentioned why is it better... some one has to take care of repartitioning -> either join will shuffle or we have to repartition -> its fine... pls let us know why this approach is better.
Concept is really worth testing.
Code is incomplete at places .
I took time to fill gaps.
Last line display()..will it work in scala spark ?🙄
This code will run fine on Azure Databricks.
partition.get is returning None in largeRDD line no. 14
where did small2 came from .....there is typo mistakes...can you please update it.??
getting error for line number 14 --->
error: value partitioner is not a member of org.apache.spark.sql.DataFrame
Kindly suggest
Very Nice content provided on this channel thanks for that, Q:- Can range partition work here ?
How to connect with you?
TechWithViresh@gmail.com
Video recommendatin at the end are blocking the content...
your example is not up to the mark, whatever you describe in your lecture it is not understandable, only the shake of creating a video you do this, I did not get your point whatever you told us regarding the join how it happens and what happens please describe in a much better understandable manner.