Apache Spark Core-Deep Dive-Proper Optimization Daniel Tomes Databricks

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 มิ.ย. 2024
  • Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
    About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
    Read more here: databricks.com/product/unifie...
    Connect with us:
    Website: databricks.com
    Facebook: / databricksinc
    Twitter: / databricks
    LinkedIn: / databricks
    Instagram: / databricksinc Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. databricks.com/databricks-nam...
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 81

  • @xinyuan6649
    @xinyuan6649 2 ปีที่แล้ว +32

    This is so far the most informative and in-depth talk about spark job optimization I found on TH-cam. Before this, the opt I've been doing is mostly through blindly trial & error. Thank you so much Daniel, it's amazing to see someone who can break a (sometimes) overwhelming task into basic spark concepts and apply deductive & inductive analysis.

  • @manjy5927
    @manjy5927 วันที่ผ่านมา

    Thanks, your video helps me keep my job

  • @danieltomes8566
    @danieltomes8566 4 ปีที่แล้ว +61

    Hello folks, thanks for all the support. Sorry for the delay on the Ebook, it's still coming it was just delayed. I will share it here as soon as it's available. I'm hoping Q1 this year. :)

    • @GenerativeAI-Guru
      @GenerativeAI-Guru 4 ปีที่แล้ว

      Thanks!

    • @ajaypratap4025
      @ajaypratap4025 4 ปีที่แล้ว

      Thanks Daniel, great talk. Please share the ebook, once completed.

    • @moha081
      @moha081 4 ปีที่แล้ว

      Hello Daniel,
      when i checked spark documnetation, i found that the cache is equal to persistent with MEMORY_ONLY option, did the rdd.cache is different than df.cache???
      spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence
      Thanks

    • @luxsasha
      @luxsasha 4 ปีที่แล้ว

      any update @daniel ?

    • @SpiritOfIndiaaa
      @SpiritOfIndiaaa 3 ปีที่แล้ว

      Nice 👍, any updates on eBook?

  • @syedshahasad9551
    @syedshahasad9551 3 ปีที่แล้ว +2

    Excellent explanations.
    Cleared so many wrong concepts of mine.
    Thanks man!!!!

  • @aleixfalguerascasals3329
    @aleixfalguerascasals3329 ปีที่แล้ว

    The best talk about spark optimizations in YT by far, thanks man!

  • @SpiritOfIndiaaa
    @SpiritOfIndiaaa 3 ปีที่แล้ว

    thank you so much for explaining slat addition clearly.

  • @sureshsindhwani6317
    @sureshsindhwani6317 3 ปีที่แล้ว +1

    Super talk Daniel and great insights, still waiting for the ebook though :)

  • @leoezhil5304
    @leoezhil5304 6 หลายเดือนก่อน

    Really very good session Dan.
    It’s my pleasure to work with you.

  • @saravanannagarajan1169
    @saravanannagarajan1169 2 ปีที่แล้ว

    good job Daniel Tomes. It help lots

  • @kyleligon2472
    @kyleligon2472 5 ปีที่แล้ว +5

    Great talk! Really learned alot, looking forward to the book!

    • @TheRags080484
      @TheRags080484 4 ปีที่แล้ว

      What is the name of the book? When will it come out?

  • @lbam28
    @lbam28 4 ปีที่แล้ว +1

    Great talk, with a lot takeaways!! Is there any references to the notebooks with datasets so I can recreate some of the optimizations?

  • @veerasekharadasandam139
    @veerasekharadasandam139 2 ปีที่แล้ว

    good insights, really helpful.

  • @manishmittal595
    @manishmittal595 3 ปีที่แล้ว

    Hi...Great Presentation for understanding Spark optimizations. Is there any Presentation slides to go through..since in videos..its little difficult to read those numbers...

  • @michalsankot
    @michalsankot 3 ปีที่แล้ว +1

    Excellent talk Daniel 👍 I wish I saw it when I started with Spark :-) How's it looking with mentioned e-book?

  • @thomsondcruz5456
    @thomsondcruz5456 2 ปีที่แล้ว +3

    4:45 1 Task = 1 Core can be changed using the property spark.task.cpu. The default is 1 Task = 1 Core.

  • @SandeepPatel-wt7ye
    @SandeepPatel-wt7ye 2 ปีที่แล้ว

    Thanks, Daniel, great talk. Please share the ebook link.

  • @hseham100
    @hseham100 4 ปีที่แล้ว

    great one

  • @user-ff1bm9jh2v
    @user-ff1bm9jh2v 3 ปีที่แล้ว +1

    50:49 what is the purpose of adding a constant to each row? They key will not change, it will still be computed on 1 executor. The salt column value should be different for different rows to split large key.

  • @dagmawimengistu4474
    @dagmawimengistu4474 3 ปีที่แล้ว

    How did you come up with the 16mb maxPartitionBytes? is there a general formula for it?

  • @Taczan1
    @Taczan1 2 ปีที่แล้ว +1

    Great talk! Many Thanks.
    Btw, where is the book, Daniel? :)

  • @DCameronMauch
    @DCameronMauch 4 ปีที่แล้ว

    He mentioned that the slide deck would be available. Does anyone know where to find it?

  • @vchandm23
    @vchandm23 ปีที่แล้ว

    This video is Gold stuff

  • @JaX0rton
    @JaX0rton 3 ปีที่แล้ว +1

    Can we get a link to the slides? There are tons of small details on the slides that will be easier to go through if we have the slides rather than pausing the video every time. :)

  • @entertainmentvlogs9634
    @entertainmentvlogs9634 2 ปีที่แล้ว

    How to set the spark.sql.shuffle.partition by a variable instead of a constant..means if the shuffle input data size is less then it should automatically choose less number of SQL shuffle partition if input shuffle data stage is more then the job should programmaticaly be able to determine correct partition..rather then given a constant valuem

  • @AshikaUmanga
    @AshikaUmanga 4 ปีที่แล้ว +7

    In the lazy-loading, he filtered the years from 2000-2001 , what if the calculation should be done for all the years? Can't use a filter in this case right ?

    • @joshuahendinata3594
      @joshuahendinata3594 4 ปีที่แล้ว +1

      I believe he is just using filter to reduce the number of rows to join. I think you must have the knowledge regarding the input data and know that outside a particular range, the data will have a match in the join

    • @JaX0rton
      @JaX0rton 3 ปีที่แล้ว

      +1. I think the idea is to filter as far upstream as possible. Don't do a cartesian product and then a filter if you can filter is ahead of time.

  • @SpiritOfIndiaaa
    @SpiritOfIndiaaa 3 ปีที่แล้ว

    @1:00:28 min line 22 i.e. save(histPerfPath_1y) does it work if I run it on cluster to save on hdfs path ? in foreach if i save and run it in yarn-cluster mode it would fail to save with error... hdfs temp file wont find to save.... how to solve this kind of issue ?

  • @TheRags080484
    @TheRags080484 4 ปีที่แล้ว +3

    Is the e-book available?

  • @programminginterviewsprepa7710
    @programminginterviewsprepa7710 ปีที่แล้ว +1

    If a system is created that you have to tweak it so much and understand it so much to get good performance I would say it should be redesigned

  • @SpiritOfIndiaaa
    @SpiritOfIndiaaa 3 ปีที่แล้ว +2

    if you are adding "salt" column in groupBy it would give wrong results right ... if any groupBy function results we required ?

    • @rushabhgujarathi1254
      @rushabhgujarathi1254 3 ปีที่แล้ว +2

      Yes, you are right I had the same question initially.This can be solved by running additional group by on the obtained dataset as this will run fast.We are using 2 group by to avoid skew.

  • @SuperFatafati
    @SuperFatafati 2 ปีที่แล้ว

    what command do we use to use all 96 cores while writing instead of only 10?

  • @LaxmanKumarMunigala
    @LaxmanKumarMunigala 4 ปีที่แล้ว +1

    Is there a github or some other place the data used for this exercise and the code?

  • @129ravi
    @129ravi 4 ปีที่แล้ว +3

    is the ebook available?

  • @SpiritOfIndiaaa
    @SpiritOfIndiaaa 3 ปีที่แล้ว

    @45 min , why broadcast has 4 times 12 =48 ? it should be 3 times 12 = 36 right? as we have 3 executors ?

    • @rushabhgujarathi1254
      @rushabhgujarathi1254 3 ปีที่แล้ว

      The memory will reduce to 36 GB once the GC kicks in,as the data is still lying in the memory of all the executors that is it is 48 GB.

  • @aymiraydinli5655
    @aymiraydinli5655 2 หลายเดือนก่อน

    I am curious, if setting partition sizes that small, would cause a small file issue or maybe I am missing something. Can please someone answer?

  • @ajithkannan522
    @ajithkannan522 ปีที่แล้ว

    Awesome awesome awesome

  • @chenlin6683
    @chenlin6683 5 ปีที่แล้ว +2

    Could you share the slides of this topic?

  • @nebimertaydin3187
    @nebimertaydin3187 10 หลายเดือนก่อน

    what's the link for range join optimization reference?

  • @khaledarja9239
    @khaledarja9239 4 ปีที่แล้ว +1

    Lazy loading it is just a matter of adding a filter?

  • @vinayakmishra1837
    @vinayakmishra1837 3 ปีที่แล้ว

    God Level

  • @SpiritOfIndiaaa
    @SpiritOfIndiaaa 3 ปีที่แล้ว

    what do mean by saying ... have array of table names and parallelize it ... wht you mean parallelize here ?

    • @khuatdinh
      @khuatdinh ปีที่แล้ว

      He meant kinda ideas of “multiple threads” for spark jobs for list of tables

  • @AB-xg1qb
    @AB-xg1qb 4 ปีที่แล้ว +1

    it is very helpful ! Can some one share the Ebook

  • @Prashanth-yj6qx
    @Prashanth-yj6qx 9 วันที่ผ่านมา

    Can anyone tell me why he reduced target size to 100 mB from 200mb

  • @raksadi8465
    @raksadi8465 3 ปีที่แล้ว

    In my spark version 2.4.3 job after all my transformations,computations and joins I am writing my final dataframe to s3 in parquet format
    But irrespective of my cores count my job is taking fixed amount for completing save action
    For distinct cores count-8,16,24 my write action timing is fixed to 8 minutes
    Due to this my solution is not becoming scalable
    How should I make my solution scalable so that my overall job execution time becomes proportional to cores used

  • @SpiritOfIndiaaa
    @SpiritOfIndiaaa 3 ปีที่แล้ว

    @22 min where did you get Stage 21 shuffle input size ?

  • @saiyijinprince
    @saiyijinprince 3 ปีที่แล้ว +3

    Why is the first example a valid comparison? You reduced the size of the data you are working with so obviously it will run faster. What if you actually need to process all years instead of just two?

    • @mdtausif3480
      @mdtausif3480 3 ปีที่แล้ว

      I have the same question.. didn't get how reducing the number of input file is an optimization.

  • @loveleshchawla4406
    @loveleshchawla4406 3 ปีที่แล้ว +5

    Slides - www2.slideshare.net/databricks/apache-spark-coredeep-diveproper-optimization?qid=68cf4b27-c5a2-48a1-96a1-faa3b411ebb4&v=&b=&from_search=1

  • @Oscar-pj5cb
    @Oscar-pj5cb ปีที่แล้ว

    Could you share slide?

  • @michailanastasopoulos1084
    @michailanastasopoulos1084 4 ปีที่แล้ว +6

    Why are 540 partitions not good? It is explained at 24:42 but I didn't quite get it.

    • @michailanastasopoulos1084
      @michailanastasopoulos1084 4 ปีที่แล้ว +23

      OK, got it now: Each core processes one 100 MB partition. We have 96 cores that need to process a total of 54 GB. At a given time or batch all 96 cores can process a maximum of 96x100=9600MB. That means after 5 batches the cluster processes 9600MBx5=48000MB. For the last batch the cluster needs to process the remaining 6000MB or 6000MB/100=60 partitions. Those 60 partitions will be processed by 60 processors which is 62,5% of the cluster. The remaining 36 processors which is 37,5% of the cluster will be idle in the last batch. The story looks different if we had 480 112,5 MB partitions. This gives us 10800 MB per batch. And all data are processed after 5 batches with 100% CPU utilization.

    • @joshuahendinata3594
      @joshuahendinata3594 4 ปีที่แล้ว +1

      @@michailanastasopoulos1084 Thank you so much. A much needed explanation!

    • @ejmsp
      @ejmsp 3 ปีที่แล้ว

      @@michailanastasopoulos1084 thanks, i was with the same doubt

    • @bikashpatra119
      @bikashpatra119 3 ปีที่แล้ว

      @@michailanastasopoulos1084 can you please help me understand why the partition size is decided to be 100MB?

    • @amontesi
      @amontesi ปีที่แล้ว

      @@bikashpatra119 my guess is that just because it is less than the notional upper bound of 150-200 MB/partition.
      There doesn't seem to be a formula that will return the amount of spill, given a number of inputs (i.e. shuffle input size and list of tasks on this stage) but since the general logic is "smaller target partition sizes result in a reduction of spill, even if we cannot predict how much spill will result, let's just fastrack the spill reduction process by just halving the notional upper bound for the target partition size of 200 MB/partition".

  • @gardnmi
    @gardnmi 6 หลายเดือนก่อน

    It's almost 2024 and the default 200 shuffle partitions are still wreaking havoc on pipelines.

  • @vliopard
    @vliopard 4 หลายเดือนก่อน

    I was trying to load 600 million rows to a pandas dataframe from SQL Server. It was taking too long and then OOM error. I tought pyspark will solve that problem. But after 42 minutes watching this video, I see that's better using a cluster with more RAM and raise SQL Server processor. Most of requirements to setup pyspark are not known on my environment so pyspark is useless when you are working with data you don't know details about.

  • @raviiit6415
    @raviiit6415 10 หลายเดือนก่อน

    watched it

  • @AnkushSingh-hi6gj
    @AnkushSingh-hi6gj 3 ปีที่แล้ว

    I can never do all of it

    • @harshthakur3248
      @harshthakur3248 10 หลายเดือนก่อน

      Checking if you can 😅

  • @Gerald-iz7mv
    @Gerald-iz7mv ปีที่แล้ว

    are all pandas UDF vectorized?