I built data pipelines at Netflix that ran 2000 TBs per day, here’s what I learned about huge data!

Data with Zach

มุมมอง 415 855

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 1 ต.ค. 2024
Check out my academy at www.DataExpert.io where you can learn all this in much more detail!
You can get use code ZACH15 to get 15% off!
#dataengineering
#netflix

ความคิดเห็น • 390

@sevrantw8931 6 หลายเดือนก่อน ⁺³²⁶⁰
I’m so glad I found this video, I was just sitting here with 60 million gigabytes and was figuring out what joins to use so this was perfect timing.
@aripapas1098 6 หลายเดือนก่อน ⁺¹¹
if all u registered was 60 mil gb & joins ur not flowing
@smackastan5697 6 หลายเดือนก่อน ⁺³⁴
You're kidding, but somehow I just started a data analysis project of two terabytes and this video shows up.
@hi-mn5rg 6 หลายเดือนก่อน ⁺¹³
@@aripapas1098 if you think comments must indicate a user registered every aspect of a video, ur not following
@derickd6150 6 หลายเดือนก่อน ⁺²
@@aripapas1098this is a sad comment
@00Tenrai00 6 หลายเดือนก่อน
Sarcasm ???? 😂
@bilbobeutlin3405 6 หลายเดือนก่อน ⁺²⁴⁰⁷
Can't wait to build hyperscale pipelines for my startup with 0 users
@92kosta 6 หลายเดือนก่อน ⁺⁶⁷
But it sounds powerful when you say it, like you mean business.
@npc-drew 6 หลายเดือนก่อน ⁺⁶
Based
@vikingthedude 6 หลายเดือนก่อน ⁺⁶
1 user (me)
@JGComments 6 หลายเดือนก่อน ⁺¹³
If you build it, they will come.
@abhilashpatel6852 6 หลายเดือนก่อน ⁺¹
I have 1k TB data just sitting around in my backyard. Glad your video came up to get me started on atleast something.
@lucas.p.f 6 หลายเดือนก่อน ⁺⁵⁴⁴
Boyfriend simulator: you sit with your bf and he starts talking about this nerdy stuff you have no idea about but need to keep listening because you love him
@EcZachly_ 6 หลายเดือนก่อน ⁺⁵⁰
This is exactly correctly
@CU.SpaceCowboy 6 หลายเดือนก่อน ⁺¹⁰
aww 🥰
@heykike 4 หลายเดือนก่อน
After marriage they no longer pretend to listen to
@rajns8643 4 หลายเดือนก่อน ⁺²
If only a girl would fall for me when I speak nerdy stuff 🫠
@lucas.p.f 4 หลายเดือนก่อน ⁺²
@@rajns8643 are you kidding me? This is what most people like the most! Intelligent people are extremely attractive
@supercompooper 6 หลายเดือนก่อน ⁺⁶⁹⁷
In the future a wrist watch will have a little blinking light that will have 60 million gigabytes of data in it
@dhillaz 6 หลายเดือนก่อน ⁺⁹⁶
You mean an Electron app?
@aripapas1098 6 หลายเดือนก่อน
yeah okay crack smoker
@mrevilducky 6 หลายเดือนก่อน ⁺⁴⁰
And it will still lag and hit 99% singularities
@Ivan-Bagrintsev 6 หลายเดือนก่อน ⁺¹²
@@dhillaz that will just show current time
@supercompooper 5 หลายเดือนก่อน ⁺⁹
@@Ivan-Bagrintsev Yes it will show the time, but with full DRM. Unless you have a license to view certain minutes it will be denied.
@subhasishsarkar5106 6 หลายเดือนก่อน ⁺⁴⁴¹
What I absolutely love about your videos is that as a beginner in the data engineering field, you often talk about things that I had no conception of. In this video for example, I have never heard of SMBs or broadcast joins. This gives me an oppurtunity to learn these things, even hearing them be mentioned from someone as widely experienced as you.
You need not necessarily have to even go into detail, but these short form videos act as beacons of knowledge that I can throw myself into learning about.
Thanks a lot, and keep these coming Zach!
@EcZachly_ 6 หลายเดือนก่อน ⁺⁷⁷
Really appreciate this comment! It reminds to that the value im putting out there is important!
@vasudevreddy3527 6 หลายเดือนก่อน ⁺²
@@EcZachly_ ✌
@eric.batdorff 6 หลายเดือนก่อน ⁺¹⁰
Great summation! I was thinking the exact same thing while watching. It's nice hearing even the specialized lingo from technical experts in their fields, it peaks my curiosity.
@MrAmitkr007 6 หลายเดือนก่อน
@@EcZachly_thanks
@prawtism 6 หลายเดือนก่อน ⁺²
@@EcZachly_did you already know the importance of these two before Netflix or did you learn that while working at Netflix?
@supafiyalaito 6 หลายเดือนก่อน ⁺⁹²
Thanks Zach, hopefully one day I will understand what all of that means
@mu3076 2 หลายเดือนก่อน
😂😂😂, I’m starting now
@JGComments 6 หลายเดือนก่อน ⁺¹⁵
2 pita bites a day, the same as me when I’m on a diet.😊
@RichardOles 6 หลายเดือนก่อน ⁺⁶⁴
Holy crap. I’m currently learning about data science, the various roles, etc. -with the hope of one day switching careers. But the current state of learning is all about the languages and software used etc, not about the infrastructure and what to do with massive datasets. So this just 🤯
@samuelisaacs7557 4 หลายเดือนก่อน ⁺²
its really about math but no one talks about it. get at least 1 year university math comprehension and then get into the python and tech tools. the most competent and successful data engineers are always people with a good STEM background. for example Zach has a Bachelor's Degree in Applied Mathematics and a Bachelor's Degree in Computer Science so he is a heavy numbers guy. That's what most of Data Science \ Engineering TH-camrs don't tell their viewers cause that will cause them to loose viewers.
@byRoyalty 4 หลายเดือนก่อน ⁺¹
learning the tools can be very different from solving real world problems.
@rajns8643 4 หลายเดือนก่อน
@@samuelisaacs7557 True asf
@stevess7777 4 หลายเดือนก่อน
@@samuelisaacs7557Yep, even a business administration bachelors will have a lot of maths and it's nowhere near data science which is 3x that.
@IAmAlpharius14 4 หลายเดือนก่อน ⁺⁶
Sir this is a Wendy's.
@Bostonaholic 6 หลายเดือนก่อน ⁺⁴⁶
I love that you kept it short and to the point.
@tobiastho9639 6 หลายเดือนก่อน ⁺²
He sure wanted to save some data… 😅
@AkhilSharmaTech 6 หลายเดือนก่อน ⁺²
Yes but why does he look like a French model
@Myziikhavus 6 หลายเดือนก่อน ⁺⁷
Do u actually like every comment 😮
@Jc12x06 6 หลายเดือนก่อน ⁺¹²
Dude has beef with Bezos😂
@rembautimes8808 6 หลายเดือนก่อน ⁺⁷⁴
Great content, an honour to be able to listen to someone who has handled that volume of data.
@deleater 6 หลายเดือนก่อน ⁺¹
literally 🎉
@codecaine 6 หลายเดือนก่อน
Have chat gpt explain it too you or some other LLM.
@aarjunpp 5 หลายเดือนก่อน ⁺²
1. Are you a data engineer?
2. What tech is this? AWS, Snowflake?
@rohanbhakat2922 6 หลายเดือนก่อน ⁺⁸
Thanks for the info Zach. Could you please make an elaboriative video on SMB join.
@dazzassti 6 หลายเดือนก่อน ⁺²⁰
In the 37 years I’ve been working in data, I’ve never heard anyone call it Peter 😂. PETA
@anotherguy9402 6 หลายเดือนก่อน
What's wrong with a Peter bite?
@divinecomedian2 5 หลายเดือนก่อน ⁺¹
Heya Peeda
@Starmast3rmusic 5 หลายเดือนก่อน
Could be an accent or a slip 😂
@earthling_parth 6 หลายเดือนก่อน ⁺⁶
Imma wait for Primeagen to confirm this as well when he reacts to this video inevitably 😁
@cesarfigueroa6119 6 หลายเดือนก่อน ⁺²
tbh if youre dealing with this much data, it's likely a good problem to have 💰 💰
@internetcancer1672 6 หลายเดือนก่อน ⁺⁴
My problem is how do people even find out about the careers that they go into?
@udaysingh-wr2kw 6 หลายเดือนก่อน ⁺³
I dont know anything about data science? Why am i watching this?
@sergeikulikov4412 6 หลายเดือนก่อน ⁺²
You shouldn't write "s" in Terabyte per hour, just TB/hr
"TBs/hr" looks like "Terabyte*second / hour" 😅
@ChessFlix 6 หลายเดือนก่อน ⁺³
Petabyte was misspelled. Great video though.
@remo 4 หลายเดือนก่อน ⁺²
Damn I just wanted to shuffle like there’s no tomorrow and then I found this video.
@nikonnikiforoff 6 หลายเดือนก่อน ⁺²
If I shuffled all the word in this video, it would still sound same to me.
@KoulickSadhu 24 วันที่ผ่านมา ⁺¹
Thanks Zach for the insightful video. I have a similar use-case. Hence, a few questions:
1. So, with the large volumes of the datasets, do you archive it/compress it/just set a TTL to it? What do you suggest would be the best way for this.
2. With such large datasets, while I join the two tables, bucketing along with partitioning would be the most viable option right? Can you make a video around the joins if possible.
Thanks!
@christine_notchristina 4 หลายเดือนก่อน ⁺²
How in the hell did I get to this side of algorithm? 😢 hahaha
@tschaderdstrom2145 6 หลายเดือนก่อน ⁺²
I love pita bites as much as the next guy, but I don't think I can take more than 35 before I'm full
@naraendrareddy273 15 วันที่ผ่านมา ⁺¹
As a guy struggling to get a job because entry level roles require ex[erience, I have learned something new and valuable today. Broadcast and SMB joins.
@JimRohn-u8c 5 หลายเดือนก่อน ⁺²
Did Facebook use Databricks or did they have HPC Clusters for you to run Spark on?
@Nayte08 6 หลายเดือนก่อน ⁺²
you could have made up all of this jargon and it wouldn’t change my perception on what you were talking about 😂
please do not take this as a negative comment, im just completely ignorant
@SahilKashyap64 6 หลายเดือนก่อน ⁺⁴
I've never heard of these terms, thank you sharing your real case scenarios(The FB notification example)
@mikishwagg 6 หลายเดือนก่อน ⁺²
Me watching this not knowing anything hes talking about makes me feel like starting a big tech company 😀
@SamCyanide 6 หลายเดือนก่อน ⁺²
My medical science clients called, they need an 800tb imaging data set parsed by end of day (thank you kubernetes)
@GoodEvenings 3 หลายเดือนก่อน ⁺¹
What the fuck is he talking about? How can you shuffle the pipe and bucket the joins. Who are the tables? Why is bucketing joins?
@RajveerSingh-vf7pr 2 หลายเดือนก่อน ⁺¹
Wow, if I knew all this, it's pretty amazing content...
If only...
@LaurentziueXtream 4 หลายเดือนก่อน ⁺¹
Can you help me get a job as Data Analyst? I have certifications but employers never hire me
@swiatlowiekuiste หลายเดือนก่อน ⁺¹
That's a ridiculous amount of data, but wait till you see my girlfriend's Messenger 😂
@DxWangZ 6 หลายเดือนก่อน ⁺¹
I don't quite understand why Netflix needs data pipelines.
@seansingh4421 3 หลายเดือนก่อน ⁺¹
Wth is in that data ? Like seriously i feel like most of that would be redundant shit since even a chemical plant can be run entirely on Excel without ever needing db involvement
@rocheri 4 หลายเดือนก่อน ⁺¹
Bleh, who actually shuffles these days at these high volumes! Bucket joining ftw!
@FMJ777 5 หลายเดือนก่อน ⁺¹
Peterbytes Bucket joining manage shuffle and fcuk Jeff bezos is what I got out of this
@arbol41 5 หลายเดือนก่อน ⁺²
Thanks Zach , but I have a question broadcast join is used when we have a small dimensions joined with big table this is your case? Or are you used hash join with two large table?
@revel-88 2 หลายเดือนก่อน ⁺¹
Subscribing just for the britto. One of my favourite hoods
@yokothespacewhale หลายเดือนก่อน ⁺¹
Hold my beer while I cross join Amazon to Netflix
@SoumilShah 2 หลายเดือนก่อน ⁺¹
100TB keyword and trying to sell courses at end of the day
@EcZachly_ หลายเดือนก่อน
If you ever want to collab let me know!
@abdullah7891 19 วันที่ผ่านมา ⁺¹
Wow I didn’t even know that such joins existed. No one taught me 😮
@zRhid 6 หลายเดือนก่อน ⁺¹
The way I only know very basic networking and have no idea what you’re talking about
@yippykayyay 2 หลายเดือนก่อน ⁺¹
No idea what this guy is talking about, but thankful TH-cam sent me this
@hearhaw 3 หลายเดือนก่อน ⁺¹
I'd like to learn more about these pitabytes. What are they? What do they taste like?
@WM-eg4gh 6 หลายเดือนก่อน ⁺⁶
Thank you Zach for taking the time to give us the hard truth and hands down your experience. It helps a lot of enthuastic students/people to know how we can in some way support or help others in the subjects we like. I don't imagine myself processing 2000TBs per day, but it helps give a bigger picture. Once again, appreciate the short video and thank you for sharing
@andreas1989 2 หลายเดือนก่อน ⁺¹
Hey
Data with Zach
.. I have some questions.. So netflix uses AWS servers all over the world.... I am wondering. how many gb is each 4K movies, 1080p movie.. ? :) and what audio mix do they have.. Dolby Atmos, DTX etc. etc. :) Have a good day.. love from sweden :)
@EcZachly_ 2 หลายเดือนก่อน
For serving videos they use OpenConnect and CloudFront, not AWS servers. This allows them to serve the video from the closest regional spot to you.
Almost all videos can be served in 4k. but are downsampled depending on the current network conditions
@TheInterestingInformer 6 หลายเดือนก่อน ⁺²
I’m trying to get into data analytics and most of this we t over my head but this still sounds lit 🔥
@parthmalik1 4 หลายเดือนก่อน ⁺¹
In a way thats not gonna make Jeff Bezoz Millions of dollars 😂😂
@GnomeEU 3 หลายเดือนก่อน ⁺¹
Now I just need a billion dollar company to have these kinda problems.
My question would be, why you have table that big? Can't you distribute or cluster your data?
I'm thinking like 10000 users per server. Only stuff around those 10k users gets stored.
No magic needed to query stuff.
@EcZachly_ 3 หลายเดือนก่อน ⁺¹
Gotta analyze it all together though
@denysolleik9896 หลายเดือนก่อน ⁺¹
You keep that $750k salary. I’d rather sleep better.
@MrFraiche 3 หลายเดือนก่อน ⁺¹
How do you get a job in this field? Were you in software engineering?
@awesomebears 4 หลายเดือนก่อน ⁺¹
Wait, i have 200TB/hr what do I do? Please help!
@shabirparwaz 3 หลายเดือนก่อน ⁺¹
Me a random viewer because yt algorithm said so: 🤔🧐
@markusxmarkus3803 6 หลายเดือนก่อน ⁺¹
are you parsing customer facial expressions or something?
@udirt 6 หลายเดือนก่อน ⁺¹
I wanna hear about "everything I need to know about extreme high volume" AFTER you compare notes with people from various other organisations processing that much or lots more, that are outside your expertise. I.e. LLNL, CERN etc, chemical industry or any large government etc., and scaled down ones, too where you recognise your lessons.
Otherwise it's just an incomplete angle. Which would still be OK if you reflect on it. But telling everything people need to know? Nah. You cant, i can't, they cant.
@EcZachly_ 6 หลายเดือนก่อน ⁺³
It’s a two minute video bro, relax
@Kvltklassik 4 หลายเดือนก่อน ⁺¹
I built data pipelines at Netflix that ran 2000000000 MBs per day
@id104335409 3 หลายเดือนก่อน ⁺¹
I know the words, but the sentences are not familiar 🤔
@Llanowyn 6 หลายเดือนก่อน ⁺²
I would be interested in the architecture and content delivery for pre and post cdn from a network design perspective. Are there any examples or presentations regarding networking at netflix?
@problemat1que 2 หลายเดือนก่อน ⁺¹
Minimizing retention and broadcast joins could have been ten seconds of the video, and the rest could have been productively spent explaining SMB joins with a diagram
@EcZachly_ 2 หลายเดือนก่อน
Make that video and share it with me!
@twitchizle 2 หลายเดือนก่อน ⁺¹
I really wonder how netflix achieves 100tb/hr just with only streaming videos.
@TJ-vh2ps 6 หลายเดือนก่อน ⁺¹
But… every day I’m shufflin’ 😢
@madarah8533 4 หลายเดือนก่อน ⁺¹
I have 60 million cross joints but not enough time to smoke em
@Dmytro-kt3fr 4 หลายเดือนก่อน ⁺¹
would you say that using bucketing and basically constraining against “acceptable” throughput as well as risking on creating gazillion files in process is more acceptable approach then more ad hoc ones like: z ordering and bloom filters?
@ArjunRajaS 5 หลายเดือนก่อน ⁺³
If you come across a scenario to join 2 large datasets. You could do an iterative broadcast join. Basically you are going the break one of the df into multiple dfs and join the dataframe in a loop till all the multiple dfs are joined.
@jordanmessec5332 5 หลายเดือนก่อน
You’ll require a lot of memory and have long start times, no?
@chrishabgood8900 6 หลายเดือนก่อน ⁺²
Is this only available with sparksql?
@jordanmessec5332 5 หลายเดือนก่อน
No, broadcasts can be leveraged in any processing framework that leverages two sets of processing logic. Your highly parallelized logic as well as a commonly single process. The single process “broadcasts” data for all of the parallel instances. It can be implemented other ways but that is the most common.
@brandonheaton6197 4 หลายเดือนก่อน ⁺¹
He is channeling a young William Benney over here isn't he
@kushaljain92 4 หลายเดือนก่อน ⁺¹
Has anyone told that you look a lot like Carlos Sainz Jr.?
@satyris410 5 หลายเดือนก่อน ⁺¹
Linus looks different somehow 😅
@AlexGeek 6 หลายเดือนก่อน ⁺²
I liked the content, but Jeff Bezos didn't
@elferpe27 2 หลายเดือนก่อน ⁺¹
Wow, didn't know Owen Wilson was working on data
@Hishamhh93 4 หลายเดือนก่อน ⁺¹
Bro is the PewDiePie of data Engineering
@JB-fj5bm 5 หลายเดือนก่อน ⁺¹
Turn off shuffle service… don’t sort
@findmeinthecarpet 3 หลายเดือนก่อน ⁺¹
Wait? So youre making a table? 🤔🥴
@souravghosh358 6 หลายเดือนก่อน ⁺³
Very important concept in such short time.. thank u so very much ❤
@JamesKing-om5mt 4 หลายเดือนก่อน ⁺¹
What did you do to those massive data sets with your huge mouth?
@dexnow 3 หลายเดือนก่อน ⁺¹
I suddenly feel like pita bread...
@cry2love 6 หลายเดือนก่อน ⁺²
I still bite my gigas when my man hustling meta in peta
@fillipeamg5877 6 หลายเดือนก่อน ⁺¹
Oloko o cara usando camisa de Romero Brito?
@Settings404 5 หลายเดือนก่อน ⁺¹
This was so fucking interesting
@wlockuz4467 6 หลายเดือนก่อน ⁺²
Did you work with Theprimagen 😂
@farid5mosa 4 หลายเดือนก่อน ⁺¹
I always shuffle. Like always.
@edisco3643 4 หลายเดือนก่อน ⁺¹
Can you get a tripod for your cam?
@ljubomirculibrk4097 6 หลายเดือนก่อน ⁺¹
All that dara flow just for selfies and self endulgance. 99% crap, but perhaps those 1% pays off.
Perhaps...
@zb2747 5 หลายเดือนก่อน ⁺¹
Bro 100 TB an hour???? Yo whattt
@john_paul 5 หลายเดือนก่อน ⁺¹
I love how you acronym Sorted Bucket Merge as SMB. Think you may have had Super Mario Bros on the mind 😂
@LazerDon271 3 หลายเดือนก่อน ⁺¹
Small fry honestly.
@maggiejetson7904 3 หลายเดือนก่อน ⁺¹
Honestly, 2000 TB per day isn't the problem. The problem is the cost and how much of the data is burst. If it is not burst it is pretty much always cheaper to do it in-house with your own hardware than to pay and rent the cloud to do it.
@vikrampandit2174 6 หลายเดือนก่อน ⁺²
Never thought broadcast join is a Netflix saviour
@aamer2411 6 หลายเดือนก่อน ⁺⁴
Just started following you. Really appreciate you for sharing your knowledge with the community.
@picdu2891 4 หลายเดือนก่อน ⁺¹
I love technology and I know more than your average user, yet I have no IT qualifications and I am light years away from this knowledge, but for some reason, I love watching these videos as if I was ever going to use the information 😂
@chrism3790 4 หลายเดือนก่อน ⁺¹
What engine were you using to do these massive joins? Spark?
@EcZachly_ 4 หลายเดือนก่อน
Yep!
@sidneydecker-buntzman6183 3 หลายเดือนก่อน ⁺¹
Just download more ram
@TheGoodContent37 4 หลายเดือนก่อน ⁺¹
Love the way you tried to make it sound more complicated than it actually is and failed.
@matthew.m.stevick 4 หลายเดือนก่อน ⁺¹
What’s a megabyte

ต่อไป

เล่นอัตโนมัติ

How to pick between Kimball, One Big Table, and Relational Modeling as a data engineer