Databases Vs Data Warehouses Vs Data Lakes - What Is The Difference And Why Should You Care?

Inside Apache Flink: A Conversation with Robert Metzger | Ep. 13 | Real-Time Analytics Podcast

Don't Lead A Data Team Before Watching This - 5 Lessons You Need To Know As A Head of Data

ศึกมวยไทยพันธมิตร 16/12/2024

ผู้หญิงแต่งงานกับขอทาน แต่กลับถูกดูหมิ่น ในที่สุดชายขเทานก็เผยตัวตย#ละครหวานๆ#ชอบ

LIVE🔴 : Cambodia vs Timor-Leste | ASEAN Championship 2024 | 17.12.24

Apache Spark Vs Apache Flink - Looking Through How Different Companies Approach Spark And Flink

Seattle Data Guy

มุมมอง 10 276

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 26 ม.ค. 2025

ความคิดเห็น • 24

@danhorus 7 หลายเดือนก่อน ⁺³
13:03 in Spark, we avoid Python UDFs like the plague because they're much slower than native Spark code. I wonder if the same is true for Flink, given that it also runs on JVMs. A quick Google search indicates that vectorized UDFs are a thing in Flink too, so I assume the same limitations apply
@SeattleDataGuy 7 หลายเดือนก่อน ⁺¹
Thanks for the added context! It's much appreciated I now am thinking if I have ever had a good experience with a UDF 🤣. I always remember touting them, but even in one case where i do recall trying it out on SQL Server, we found it slow.
@danhorus 7 หลายเดือนก่อน ⁺¹
@@SeattleDataGuy With Spark, there are several ways to write transformations. By far, the best option is to use native Spark functions, as they compile to highly optimized and parallelized Java byte code. The second best option is to write UDFs in Scala or Java, as everything still runs in the same JVM. The third best option, in case you want/need to use Python, is to write a vectorized UDF (also known as Pandas UDF), which leverages Apache Arrow to move data between the JVM and the Python interpreter in batches. Finally, as a last resort, you can use regular Python UDFs, however they're a lot slower because they basically compute results row by row rather than in big batches. If you have slow Spark jobs using Python UDFs, refactoring them is usually a good way to gain some performance. About this blog post, I'm not sure the author is aware of this limitation, but if they need this code to run very very fast, they should probably avoid Python UDFs too
@danhorus 7 หลายเดือนก่อน ⁺¹
@@SeattleDataGuyI wrote a long comment about the different types of UDFs in Spark, but apparently TH-cam decided to delete it. Maybe you'll find it marked as spam, lol
@SeattleDataGuy 7 หลายเดือนก่อน
@@danhorus Did you put a url in it? That seems to be the main reason I have seen youtube define things as spam. I'll look
@danhorus 7 หลายเดือนก่อน ⁺⁴
Not really, but let's try again, haha. In Spark, there are many ways to apply data transformations. By far the best option is to use native Spark functions, as they compile to highly optimized/parallelized Java byte code. The second best option to maximize performance is to use Scala or Java UDFs, as they run inside the JVM with a minor performance hit. The third option, if you want/need to use Python, is to write a vectorized UDF (also known as Pandas UDF), which leverages Apache Arrow to transfer big batches of records to the Python interpreter and back to the JVM after processing. Finally, the last option you should consider is the regular Python UDF, as it basically transforms row by row and has much worse performance as a result. If you have a slow Spark job, refactoring Python UDFs can make it a lot faster. I'm not sure the authors of the blog post are aware of this, but they can probably make their code faster too
@osoucy 7 หลายเดือนก่อน ⁺⁶
To me, one of the main benefit of Spark Structured Streaming is that you can easily switch between near real-time (micro batches) and scheduled batch processing without having to re-writing a single line of code. This is a very effective way of scaling up and down and balancing costs vs latency.
@SeattleDataGuy 7 หลายเดือนก่อน ⁺¹
that is very useful! when do you think micro-batches make the most sense
@jace743 7 หลายเดือนก่อน ⁺⁵
I’d watch if you did live article reviews!
@SeattleDataGuy 7 หลายเดือนก่อน ⁺²
Yeah! I think watching other creators do it, I really gotta slow down to do it well
@thedailyepochs338 6 หลายเดือนก่อน
Love the Video, i have a question though , do you have to have a good understanding of Java to kind of implement Java in production. it seems like kafka and java client libraries go hand in hand . What are your thoughts on this ?
@DataPains 7 หลายเดือนก่อน ⁺¹
Great video! Thank you for sharing!
@SeattleDataGuy 7 หลายเดือนก่อน
thanks for watchin!
@richardmartin6605 7 หลายเดือนก่อน ⁺²
Would love to see article reviews!
@SeattleDataGuy 7 หลายเดือนก่อน
awesome! any particular articles!
@damien__j 7 หลายเดือนก่อน ⁺¹
Great video thanks!
@SeattleDataGuy 7 หลายเดือนก่อน
Glad you liked it!
@knkootbaoat6759 7 หลายเดือนก่อน ⁺⁷
gotta make things complex otherwise we wouldnt get paid as much. i half joke. we dont make it complex it's just situations are inherently complex
@SeattleDataGuy 7 หลายเดือนก่อน ⁺³
we do tend to do that some times....

ต่อไป

เล่นอัตโนมัติ

Databases Vs Data Warehouses Vs Data Lakes - What Is The Difference And Why Should You Care?

Databases Vs Data Warehouses Vs Data Lakes - What Is The Difference And Why Should You Care?

Inside Apache Flink: A Conversation with Robert Metzger | Ep. 13 | Real-Time Analytics Podcast

Inside Apache Flink: A Conversation with Robert Metzger | Ep. 13 | Real-Time Analytics Podcast

Don't Lead A Data Team Before Watching This - 5 Lessons You Need To Know As A Head of Data

Don't Lead A Data Team Before Watching This - 5 Lessons You Need To Know As A Head of Data

ศึกมวยไทยพันธมิตร 16/12/2024

ศึกมวยไทยพันธมิตร 16/12/2024

ผู้หญิงแต่งงานกับขอทาน แต่กลับถูกดูหมิ่น ในที่สุดชายขเทานก็เผยตัวตย#ละครหวานๆ#ชอบ

ผู้หญิงแต่งงานกับขอทาน แต่กลับถูกดูหมิ่น ในที่สุดชายขเทานก็เผยตัวตย#ละครหวานๆ#ชอบ

LIVE🔴 : Cambodia vs Timor-Leste | ASEAN Championship 2024 | 17.12.24

LIVE🔴 : Cambodia vs Timor-Leste | ASEAN Championship 2024 | 17.12.24

#WOWxดราม่าคอมเม้นแฟนบอลอาเซียน ตะลึง!! แห่ชื่นชมสปิริตทีมชาติไทย หลังเกมส์พลิกชนะสิงคโปร์ 4-2

#WOWxดราม่าคอมเม้นแฟนบอลอาเซียน ตะลึง!! แห่ชื่นชมสปิริตทีมชาติไทย หลังเกมส์พลิกชนะสิงคโปร์ 4-2

Apache Flink Vs. Apache Spark Vs. Apache Storm: Which Data Processing Tool is Right for You!

Apache Flink Vs. Apache Spark Vs. Apache Storm: Which Data Processing Tool is Right for You!

Проблемы приземления данных из Kafka и их решение на Apache Flink / Вадим Опольский (IT_ONE)

Проблемы приземления данных из Kafka и их решение на Apache Flink / Вадим Опольский (IT_ONE)

What is Apache Flink®?

What is Apache Flink®?

When to Use Kafka or RabbitMQ | System Design

When to Use Kafka or RabbitMQ | System Design

Apache Spark - Computerphile

Apache Spark - Computerphile

Diving Deep into Apache Flink with Robert Metzger | Ep. 14 | Real-Time Analytics Podcast

Diving Deep into Apache Flink with Robert Metzger | Ep. 14 | Real-Time Analytics Podcast

Is Flink the answer to the ETL problem? (with Robert Metzger)

Is Flink the answer to the ETL problem? (with Robert Metzger)

Introduction to Stateful Stream Processing with Apache Flink • Robert Metzger • GOTO 2019

Introduction to Stateful Stream Processing with Apache Flink • Robert Metzger • GOTO 2019

Ask Databricks about Spark Structured Streaming with Simon Whiteley and Ray Zhu.

Ask Databricks about Spark Structured Streaming with Simon Whiteley and Ray Zhu.

ไฮไลท์การแข่งขัน สิงคโปร์ 2-4 ไทย | ฟุตบอล ASEAN Mitsubishi Electric Cup™ 2024

ไฮไลท์การแข่งขัน สิงคโปร์ 2-4 ไทย | ฟุตบอล ASEAN Mitsubishi Electric Cup™ 2024

OHANA บ้าพลัง EP.134 : เกมการ์ดโอฮาน่า X วัยหนุ่ม 2544

OHANA บ้าพลัง EP.134 : เกมการ์ดโอฮาน่า X วัยหนุ่ม 2544

🔴LIVE สด! PGC 2024 ศึกชิงแชมป์โลกพับจี Circuit 3 วันที่ 1

🔴LIVE สด! PGC 2024 ศึกชิงแชมป์โลกพับจี Circuit 3 วันที่ 1

Apko konsa RC Bus Accah laga

Apko konsa RC Bus Accah laga

ไฮไลท์ฟุตบอล พรีเมียร์ลีก 2024/25 สัปดาห์ที่ 16 : แมนเชสเตอร์ ซิตี้ พบ แมนเชสเตอร์ ยูไนเต็ด

ไฮไลท์ฟุตบอล พรีเมียร์ลีก 2024/25 สัปดาห์ที่ 16 : แมนเชสเตอร์ ซิตี้ พบ แมนเชสเตอร์ ยูไนเต็ด

🔴LIVE โหนกระแส ศึกชิงมรดก 500 ล้าน ทายาทฟ้องเด็กรับใช้ปลอมลายเซ็น

🔴LIVE โหนกระแส ศึกชิงมรดก 500 ล้าน ทายาทฟ้องเด็กรับใช้ปลอมลายเซ็น

#นายกแพทองธาร ลงพื้นที่มอบถุงยังชีพ บริเวณ ซ.พัฒนาการคูขวาง ๑๐ (ถ.ท่าโพธิ์) จ.นครศรีธรรมราช

#นายกแพทองธาร ลงพื้นที่มอบถุงยังชีพ บริเวณ ซ.พัฒนาการคูขวาง ๑๐ (ถ.ท่าโพธิ์) จ.นครศรีธรรมราช