Apache Spark Vs Apache Flink - Looking Through How Different Companies Approach Spark And Flink

แชร์
ฝัง
  • เผยแพร่เมื่อ 26 ม.ค. 2025

ความคิดเห็น • 24

  • @danhorus
    @danhorus 7 หลายเดือนก่อน +3

    13:03 in Spark, we avoid Python UDFs like the plague because they're much slower than native Spark code. I wonder if the same is true for Flink, given that it also runs on JVMs. A quick Google search indicates that vectorized UDFs are a thing in Flink too, so I assume the same limitations apply

    • @SeattleDataGuy
      @SeattleDataGuy  7 หลายเดือนก่อน +1

      Thanks for the added context! It's much appreciated I now am thinking if I have ever had a good experience with a UDF 🤣. I always remember touting them, but even in one case where i do recall trying it out on SQL Server, we found it slow.

    • @danhorus
      @danhorus 7 หลายเดือนก่อน +1

      ​​@@SeattleDataGuy With Spark, there are several ways to write transformations. By far, the best option is to use native Spark functions, as they compile to highly optimized and parallelized Java byte code. The second best option is to write UDFs in Scala or Java, as everything still runs in the same JVM. The third best option, in case you want/need to use Python, is to write a vectorized UDF (also known as Pandas UDF), which leverages Apache Arrow to move data between the JVM and the Python interpreter in batches. Finally, as a last resort, you can use regular Python UDFs, however they're a lot slower because they basically compute results row by row rather than in big batches. If you have slow Spark jobs using Python UDFs, refactoring them is usually a good way to gain some performance. About this blog post, I'm not sure the author is aware of this limitation, but if they need this code to run very very fast, they should probably avoid Python UDFs too

    • @danhorus
      @danhorus 7 หลายเดือนก่อน +1

      ​@@SeattleDataGuyI wrote a long comment about the different types of UDFs in Spark, but apparently TH-cam decided to delete it. Maybe you'll find it marked as spam, lol

    • @SeattleDataGuy
      @SeattleDataGuy  7 หลายเดือนก่อน

      @@danhorus Did you put a url in it? That seems to be the main reason I have seen youtube define things as spam. I'll look

    • @danhorus
      @danhorus 7 หลายเดือนก่อน +4

      Not really, but let's try again, haha. In Spark, there are many ways to apply data transformations. By far the best option is to use native Spark functions, as they compile to highly optimized/parallelized Java byte code. The second best option to maximize performance is to use Scala or Java UDFs, as they run inside the JVM with a minor performance hit. The third option, if you want/need to use Python, is to write a vectorized UDF (also known as Pandas UDF), which leverages Apache Arrow to transfer big batches of records to the Python interpreter and back to the JVM after processing. Finally, the last option you should consider is the regular Python UDF, as it basically transforms row by row and has much worse performance as a result. If you have a slow Spark job, refactoring Python UDFs can make it a lot faster. I'm not sure the authors of the blog post are aware of this, but they can probably make their code faster too

  • @osoucy
    @osoucy 7 หลายเดือนก่อน +6

    To me, one of the main benefit of Spark Structured Streaming is that you can easily switch between near real-time (micro batches) and scheduled batch processing without having to re-writing a single line of code. This is a very effective way of scaling up and down and balancing costs vs latency.

    • @SeattleDataGuy
      @SeattleDataGuy  7 หลายเดือนก่อน +1

      that is very useful! when do you think micro-batches make the most sense

  • @jace743
    @jace743 7 หลายเดือนก่อน +5

    I’d watch if you did live article reviews!

    • @SeattleDataGuy
      @SeattleDataGuy  7 หลายเดือนก่อน +2

      Yeah! I think watching other creators do it, I really gotta slow down to do it well

  • @thedailyepochs338
    @thedailyepochs338 6 หลายเดือนก่อน

    Love the Video, i have a question though , do you have to have a good understanding of Java to kind of implement Java in production. it seems like kafka and java client libraries go hand in hand . What are your thoughts on this ?

  • @DataPains
    @DataPains 7 หลายเดือนก่อน +1

    Great video! Thank you for sharing!

  • @richardmartin6605
    @richardmartin6605 7 หลายเดือนก่อน +2

    Would love to see article reviews!

    • @SeattleDataGuy
      @SeattleDataGuy  7 หลายเดือนก่อน

      awesome! any particular articles!

  • @damien__j
    @damien__j 7 หลายเดือนก่อน +1

    Great video thanks!

  • @knkootbaoat6759
    @knkootbaoat6759 7 หลายเดือนก่อน +7

    gotta make things complex otherwise we wouldnt get paid as much. i half joke. we dont make it complex it's just situations are inherently complex

    • @SeattleDataGuy
      @SeattleDataGuy  7 หลายเดือนก่อน +3

      we do tend to do that some times....