Everyday I'm Shuffling - Tips for Writing Better Apache Spark Programs

แชร์
ฝัง
  • เผยแพร่เมื่อ 22 ม.ค. 2025

ความคิดเห็น • 20

  • @vasukisubbaramu4656
    @vasukisubbaramu4656 9 ปีที่แล้ว

    Thanks Holden. Brilliant lecture. I am looking at some inefficiencies with SPARK joins and this video helped me in solving some. Keep it coming please.

    • @youdeepujain
      @youdeepujain 9 ปีที่แล้ว

      Vasuki Subbaramu Can you share some of those problems and solutions. I am doing joins between large RDDs and they are very slow. Throwing executors is not speeding things up.

    • @sahebbhattu
      @sahebbhattu 8 ปีที่แล้ว

      Please let me know if you have implemented anything for your large joins

  • @stackologycentral
    @stackologycentral 9 ปีที่แล้ว

    Thanks Holden and Vida. Awesome tips.

  • @igor.berman
    @igor.berman 9 ปีที่แล้ว +2

    Anybody knows how to do filter of Big Rdd before joining BigRdd with MediumRdd(described as better solution at 12:57)

    • @sahebbhattu
      @sahebbhattu 8 ปีที่แล้ว

      Actually, she meant to say that, if there is a big RDD try filtering it before doing the costlier join operation (as per the basic spark join principle) so we can split a dataframe into many dataframes , e.g. filter the big dataframe date wise and loop the operations ....so it will be safe and faster......If there is no option to filter the dataframe...then I guess there is nothing we can do ...YET...

  • @sahebbhattu
    @sahebbhattu 8 ปีที่แล้ว +1

    Thanks a lot ! As you have explained if there is a huge table joining to a small table , we can use broadcast. But, if I am having 15 joins\tables one after another .....and 3 of them are huge , let's say 1st, 5th and 11th. So, while going from 1st to 4th , i will use broadcast on 2nd, 3rd and 4th . But I can't use it on 5th (else it would be disaster in shuffling of 5th to each node). So how to handle this ? can we tune this in a better way ?

  • @mazenezzeddine8319
    @mazenezzeddine8319 9 ปีที่แล้ว

    Thanks,
    Is the number of reduce workers/machines usually equals to the number of distinct keys? how the worker machines are selected out of the available set of workers? what if I want that the reduction to be done on single machine say the driver?

  • @tesla-78661
    @tesla-78661 9 ปีที่แล้ว

    DIY approach to write to database cant I just use df.write.mode(SaveMode.Append).jdbc(....) , is using spark inbuilt JDBC connection approach not a good practise ? which is the best approach to write to database from spark rdd ?

  • @PatrickHulce
    @PatrickHulce 9 ปีที่แล้ว

    I'm with +Igor Berman. The "solution" she described is exactly the problem. If I could already filter down the "all the world" to just Californians I wouldn't need the join in the first place...anyone know what she's talking about with a filter transform here?

    • @desalgelijk
      @desalgelijk 9 ปีที่แล้ว

      +Patrick Hulce It is assumed that he two RDD's contain a different set of columns, otherwise there would be no point in joining.

    • @PatrickHulce
      @PatrickHulce 9 ปีที่แล้ว

      I see my assumption was that the California RDD was some a subset of fields like just ID and the goal was to filter down the world to those records. It seems there's no shortcut for that though.

    • @desalgelijk
      @desalgelijk 9 ปีที่แล้ว

      +Patrick Hulce I watched the part again and it does seem like she filters the world RDD by looking at the ID's in the CA RDD. Not sure how that would work without shuffle..

    • @PatrickHulce
      @PatrickHulce 9 ปีที่แล้ว

      +desalgelijk that makes three of us now then :) haha it just doesn't seem possible on the surface.

    • @koudelkaa
      @koudelkaa 9 ปีที่แล้ว

      +Patrick Hulce I found a blog post here: fdahms.com/2015/10/04/writing-efficient-spark-jobs/ discussing the whys and hows of this trick is that the key space of the medium sized RDD fits in memory and can be broadcasted. In my case my medium sized RDD does not fit in memory and my left outer join has the large RDD on the left so a filter is useless anyways. I wish spark had a built in KV store to support lookup based joins. I know it is possible to do by using an external kv store, but it is annoying and adds maintenance.

  • @deeplearningpartnership
    @deeplearningpartnership 4 ปีที่แล้ว

    Interesting.

  • @thevijayraj34
    @thevijayraj34 3 ปีที่แล้ว

    That is "Everyday I'm Hustling" I guess😂

  • @tusharsharma9012
    @tusharsharma9012 6 ปีที่แล้ว

    Shuffling Shuffling