Incredibly useful. Appreciate the way it is explained. I suggest you pick an use case and resolve a long running problem by changing cluster configuration.
The way you are explaining complex stuff that is incredible. I am a data engineer with having more than 8 years of experience, and totally loved your content
Very useful one! Thanks for making this. What would also be interesting to see is, once u find cause of a performance issue via SparkUI how you go about fixing it. Like the skew issue you mentioned, how do we fix it? Maybe a video on Spark Performance Tuning ? :)
You can often eliminate skew by repartitioning. If, however, your operation is based on grouped data and the groups are skewed then you might need to rethink your approach or resize your nodes to fit the largest partition without spill.
I skipped over the troubleshooting for that one as I go through the same example in the AQE demo. Adaptive Query Execution targets that exact problem, and you can use the Spark UI to see where it might be happening - th-cam.com/video/jlr8_RpAGuU/w-d-xo.html But yes, in general there's a spark performance tuning video/session I should probably write at some time!
@@phy2sll Yep, absolutely - though in 7.0 we've got AQE as an alternative path to fix some of that. Doesn't catch everything, which is when you'd drop back to the fixes you mentioned!
Excellent video! I am in the middle of optimizing the script for a client and well I have seen a lot of videos showing the UI as first thing but nobody talks exactly about how to take advantage of this resource. Thanks for sharing, and subscribing!
this video is super helpful, thank you very much! :) I would be very interested in the topic you mentioned briefly at the beginning about JVM. Do you explain this somewhere in more detail? Also how e.g. PySpark is interacting with JVM and how Scala comes into play here?
Thanks for introducing Ganglia, can you also make a video of how can i understand it and make more sense of the graph and data its showing please...that would be super useful
Really really appreciate this. I was hoping you were going end with showing how we might be able to use Ganglia to make assessments on how to choose the appropriate cluster size for a particular job
Yeah, definitely a lot more to dive into around Ganglia & specific performance tuning. I need to find some time to make some specific troubleshooting examples for that video - when I got close to 30 mins I thought I should probably stop for this one! Simon
how should we decide using UI, if increasing number of nodes(cores) or increasing the SKU(memory) of the Node would give me more performance benefits.. Thank you! :)
How can we choose the right number of worker node for my job ..my job was using the max number of worker node when i change from 75 - 90 ..but both time job was running fine...I did not see any change in performance
While the job is running you can check the number of tasks against the number of slots - if your job has fewer tasks than there are slots then adding more workers won't change the processing time. If you want to utilise more of the cluster, you can repartition the dataframe up to spread it across more rdd blocks. That's a careful balance as it introduces a new shuffle which may cause more performance problems than the increased parallelism! Hopefully that gives you something to look at, if not that helpful! Simkn
I generally don't have time to tidy the code up and make it separately runnable - maybe in future :) For this one, I grabbed the AQE demo from the Databricks blog, it's good to force skew, small partitions etc to use as diagnosis practice: docs.databricks.com/_static/notebooks/aqe-demo.html
Thank you very much for this uplifting video :). I was used to working with the Cloudera interface. Then, I'm wondering where the application name is. Have we lost it?
Really Wonderful stuff Simon ! Was wondering how spark /databricks handles keys , does databricks get data which already have keys from the upstream data or do you know how a dimension is created with keys being generated like a typical merge dimension procedure would do in SQL server.
Hey, thanks for watching! Like any analytics tool, you get busines keys from upstream/source systems then usually need to build new/composite keys. It's fairly common to create a hash over business keys, if you need to generate new "identity" columns there are a few patterns to follow - the monotonically_increasing_id() function is great to add new unique values on top of the a dataframe, and you can simply add the current max value of the dim if you are trying to create "surrogate key" style functionality. Quite often it's easier to stick with hash values and deal with the SCD/latest version problems downstream Simon
Incredibly useful. Appreciate the way it is explained. I suggest you pick an use case and resolve a long running problem by changing cluster configuration.
The way you are explaining complex stuff that is incredible. I am a data engineer with having more than 8 years of experience, and totally loved your content
Extremely helpful and great to touch different aspects of databricks.
Thank you - very useful as I prep for the ADV DBX DE cert!
Very useful one! Thanks for making this. What would also be interesting to see is, once u find cause of a performance issue via SparkUI how you go about fixing it. Like the skew issue you mentioned, how do we fix it? Maybe a video on Spark Performance Tuning ? :)
You can often eliminate skew by repartitioning. If, however, your operation is based on grouped data and the groups are skewed then you might need to rethink your approach or resize your nodes to fit the largest partition without spill.
I skipped over the troubleshooting for that one as I go through the same example in the AQE demo. Adaptive Query Execution targets that exact problem, and you can use the Spark UI to see where it might be happening - th-cam.com/video/jlr8_RpAGuU/w-d-xo.html
But yes, in general there's a spark performance tuning video/session I should probably write at some time!
@@phy2sll Yep, absolutely - though in 7.0 we've got AQE as an alternative path to fix some of that. Doesn't catch everything, which is when you'd drop back to the fixes you mentioned!
Excellent video! I am in the middle of optimizing the script for a client and well I have seen a lot of videos showing the UI as first thing but nobody talks exactly about how to take advantage of this resource. Thanks for sharing, and subscribing!
Great video Simon - thanks!
Thank you for such as simple and powerful explanation
You are too good! Lot of important and tons of info. Thx for sharing!
this video is super helpful, thank you very much! :)
I would be very interested in the topic you mentioned briefly at the beginning about JVM. Do you explain this somewhere in more detail? Also how e.g. PySpark is interacting with JVM and how Scala comes into play here?
Fantastic explanation. Thanks a lot!
Thanks for introducing Ganglia, can you also make a video of how can i understand it and make more sense of the graph and data its showing please...that would be super useful
I love you Sir... Please keep on adding such videos
I was waiting for this !!!!!!
Finally ! Thanks 😊
Super helpful!! Thank you so much!!!
Really really appreciate this.
I was hoping you were going end with showing how we might be able to use Ganglia to make assessments on how to choose the appropriate cluster size for a particular job
Yeah, definitely a lot more to dive into around Ganglia & specific performance tuning. I need to find some time to make some specific troubleshooting examples for that video - when I got close to 30 mins I thought I should probably stop for this one!
Simon
hi! do you know why sometimes executors on executor's tab turn blue?
thanks for the great video! Pls do make a ganglia-focused one when you have time :)
how should we decide using UI, if increasing number of nodes(cores) or increasing the SKU(memory) of the Node would give me more performance benefits.. Thank you! :)
Thanks for this intro. Get ready Spark jobs, you're gonna be examined
How can we choose the right number of worker node for my job ..my job was using the max number of worker node when i change from 75 - 90 ..but both time job was running fine...I did not see any change in performance
While the job is running you can check the number of tasks against the number of slots - if your job has fewer tasks than there are slots then adding more workers won't change the processing time. If you want to utilise more of the cluster, you can repartition the dataframe up to spread it across more rdd blocks. That's a careful balance as it introduces a new shuffle which may cause more performance problems than the increased parallelism!
Hopefully that gives you something to look at, if not that helpful!
Simkn
@@AdvancingAnalytics are you suggesting to use repartitions or not ? as my notebook takes one hours plus?
thank you, it is very clear. Do you agree to share the code used during the explanation?
I generally don't have time to tidy the code up and make it separately runnable - maybe in future :)
For this one, I grabbed the AQE demo from the Databricks blog, it's good to force skew, small partitions etc to use as diagnosis practice:
docs.databricks.com/_static/notebooks/aqe-demo.html
Thank you very much for this uplifting video :). I was used to working with the Cloudera interface. Then, I'm wondering where the application name is. Have we lost it?
💡🔐🔥
Thanks a lot for this.
You are awesome!!!
Really Wonderful stuff Simon !
Was wondering how spark /databricks handles keys , does databricks get data which already have keys from the upstream data or do you know how a dimension is created with keys being generated like a typical merge dimension procedure would do in SQL server.
Hey, thanks for watching! Like any analytics tool, you get busines keys from upstream/source systems then usually need to build new/composite keys. It's fairly common to create a hash over business keys, if you need to generate new "identity" columns there are a few patterns to follow - the monotonically_increasing_id() function is great to add new unique values on top of the a dataframe, and you can simply add the current max value of the dim if you are trying to create "surrogate key" style functionality.
Quite often it's easier to stick with hash values and deal with the SCD/latest version problems downstream
Simon
@@AdvancingAnalytics ah I see, thanks for the insights !