Just finished watching week 2 video and doing Labs! Thank you for a content, Zach! It's so informative and helpful even with my little experience in Postgres and Window functions with CTE
Thanks for the content! I applied some of these concepts in my job and it actually made a difference, excited to get to the spark section of the bootcamp!
Thank you, this makes intuitive sense make all data as compressed as possible using built-in techniques, I am a fan of the logical AND gate, pretty neat!
Thank you Zach. I've carefully watched and listened to this course. It was amazing. It's also inspiring to see how much you can give to the community. I really hope that the business you are generating is rewarding too
Hey zach, thanks for another great video! When do you use the prefix dim for column names in your tables and when don't you use it? I didn't find that convention clear from the video
Thanks for great insights. I just wonder what are the best practices for maintaining those large volume tables. The table is supposed to be updated, with more data to come regularly. Do you compact your table regularly to avoid fragmentation, given that compaction is one of the most costly actions? Do you have to adjust the Spark config frequently when the table size increases (scaling up cluster, increase the max bytes per partition, etc)? Besides, I am really curious about the Data governance in cooperation of such Meta's scale. I bet they have to have access control, data protection (column masking, row filtering, etc.). More importantly, for data lifecycle, data subject right management (user request to access their specific data, or to deleted all their data have been collected, etc.) would be one of the most complicated topics to add into the data modeling at large scale.
50:45 For hourly deduping with microbatch, according to what you said, it only dedupes for the day, sadly it doesn’t compare and dedupe with yesterday’s data then? Deduping for the day only seems like still not ideal for certain snapshots where we are only taking the latest or earliest user_id row for example
Thanks for the great course Zach. In the Day3 lecture Facebook long period analysis example that you gave, did data engineers come up with these kind of analysis? Or it was the data scientists? And the engineers did the implementation with optimisation?
@EcZachly_ , isn't that CROSS JOIN you are doing to make date list of bits an expensive operation, which bloats the dataset by 31 times? Could you please explain on that?
@@EcZachly_ Thanks for the quick reply. doesn't that mean for an example 1 billon record table will ended up with 31 billion rows? Is that a concern or we can neglect compared to the size of the data?
can someone explain me the difference between a fact table and an ordinary table in an OLTP database? As far as i understand, both contains fact or events, but fact table is built for analytical purpose while transactional db is for transactional data. So, how maybe a fact table is built from transactional db with some etl processes? sorry i just got confused
why was broadcast join with IPv6 not possible anymore? did the size of the to-be-broadcasted dataset change as well? I do not get why the change only inhibits the broadcast. Thanks for the explanation!
Ipv4 search space is DRAMATICALLY smaller than ipv6. There are 4 billion ipv4 ip addresses and. There are 340,282,366,920,938,463,463,374,607,431,768,211,456 ipv6 addresses. You can compress an ipv4 address into a trie data structure. You cannot do the same for ipv6
@zach - Would you mind breaking down the lengthy video into multiple small videos? I think it would make it easier to watch and help increase your viewership as well.
Hi Zach, i am from India and i have a question that after completing bootcamp and creating project how i apply for summer internship 2025 .To whom should i contact because no internship anywhere even in LinkedIn.How should i start ?
Spark does not support writing data to a Hive bucketed table because it uses a different hash function. Additionally, it does not provide any major performance benefits when joining two Hive bucketed tables on the bucket key. How did you solve this problem?
@@EcZachly_ I agree, but what I meant was that the open-source Spark 3.3 currently does not utilise the advantages of Hive bucketed table and instead treat it as a regular Hive table during read or join operations.
Thanks for this Zach! Kudos for sticking to your promise of releasing this one big consolidated video on Saturday.
Wish it was a little longer but had to pull some of the content since it wasn't quality enough to release
Just finished watching week 2 video and doing Labs! Thank you for a content, Zach! It's so informative and helpful even with my little experience in Postgres and Window functions with CTE
Thanks for the content! I applied some of these concepts in my job and it actually made a difference, excited to get to the spark section of the bootcamp!
solid 4 hrs Zach!!! love it!
Thanks Zack ! Your effort is clear with the content you produce
In my dreams I can already hear Zach's voice talking about Data, I am fascinated by the intense and amazing course.
Thank you, this makes intuitive sense make all data as compressed as possible using built-in techniques, I am a fan of the logical AND gate, pretty neat!
Reporting in from Australia. Also "for the algorithm" to know this should be higher in the Trendy chart
☝
Thank you Zach. I've carefully watched and listened to this course. It was amazing. It's also inspiring to see how much you can give to the community. I really hope that the business you are generating is rewarding too
Your punctuality is awesome. Thank you, teacher
It was a lot to get this out today but I'm glad we pulled it off!
It's great, Zach! I just learned something new from you.
I never imagined my Digital Logic Design Knowledge is going to help me understand this lecture🙂.
This is best, Zach. All materials are consolidated in one video👍
Thank you, I am "slowly changing" my catching up status!
Blessed to find this. Thanks, Zack, for leveling me up
very helpful. this video course should be top insightful for learning Data Engineer. Thanks Zach!
All the way from nigeria, God bless you sir.😂😂😂😂
It's real interesting and usefull.
Thank you, Zach!
Keep up the good work!
Data is the gold of the 21st century
Come on Zach. Listen to the masses. We need a Black Friday discount for dataexpert subscription pls!!
Coming soon
Hey zach, thanks for another great video! When do you use the prefix dim for column names in your tables and when don't you use it? I didn't find that convention clear from the video
Thank you man! One more subscribe from Brazil!
Great job Zach !!! Keen to binge ❤❤
This is amazing content. Thank you!!
Thanks for great insights. I just wonder what are the best practices for maintaining those large volume tables. The table is supposed to be updated, with more data to come regularly. Do you compact your table regularly to avoid fragmentation, given that compaction is one of the most costly actions? Do you have to adjust the Spark config frequently when the table size increases (scaling up cluster, increase the max bytes per partition, etc)?
Besides, I am really curious about the Data governance in cooperation of such Meta's scale. I bet they have to have access control, data protection (column masking, row filtering, etc.). More importantly, for data lifecycle, data subject right management (user request to access their specific data, or to deleted all their data have been collected, etc.) would be one of the most complicated topics to add into the data modeling at large scale.
Daily partitioning fam
@@EcZachly_ That is what I expect. But it comes with its own problem about performance issue when the regular queries address other columns.
Thanks for this awesome knowledge!
50:45 For hourly deduping with microbatch, according to what you said, it only dedupes for the day, sadly it doesn’t compare and dedupe with yesterday’s data then? Deduping for the day only seems like still not ideal for certain snapshots where we are only taking the latest or earliest user_id row for example
At 40:00, Is bucketing same as partitioning?
Nope. Partitioning is putting things in a folder. Bucketing is making new files within a folder
hey Zach if you add a 0:00 in the time stamp, the chapter will appear 👍🏻
Thanks for the tip!
Thanks for the great course Zach. In the Day3 lecture Facebook long period analysis example that you gave, did data engineers come up with these kind of analysis? Or it was the data scientists? And the engineers did the implementation with optimisation?
How's this free? Fantastic course when I was trying to refresh on fact data modelling. Thank you.
It’s coming down on January 31st
why should we name the columns starting with "dim" in the fact tables?
@EcZachly_ , isn't that CROSS JOIN you are doing to make date list of bits an expensive operation, which bloats the dataset by 31 times? Could you please explain on that?
CROSS JOINs are only expensive when you match every row to every other row. In this case, it does that but there’s
@@EcZachly_ Thanks for the quick reply. doesn't that mean for an example 1 billon record table will ended up with 31 billion rows? Is that a concern or we can neglect compared to the size of the data?
@@ThilinaKariyawasam the original activity data would be 31 billion rows right?
can someone explain me the difference between a fact table and an ordinary table in an OLTP database? As far as i understand, both contains fact or events, but fact table is built for analytical purpose while transactional db is for transactional data. So, how maybe a fact table is built from transactional db with some etl processes? sorry i just got confused
Fact tables are optimized for “whole table analysis”
OLTP transactions are meant for “single user” analysis.
Think about the WHERE clause here
why was broadcast join with IPv6 not possible anymore? did the size of the to-be-broadcasted dataset change as well? I do not get why the change only inhibits the broadcast. Thanks for the explanation!
Ipv4 search space is DRAMATICALLY smaller than ipv6.
There are 4 billion ipv4 ip addresses and. There are 340,282,366,920,938,463,463,374,607,431,768,211,456 ipv6 addresses.
You can compress an ipv4 address into a trie data structure. You cannot do the same for ipv6
I think changing from IPv4 to IPv6 would increase the length of the data type, therefore dataset is bigger?
@@domelorinczy2674 yes. IPv6 search space is much bigger
This is gold
Thank you zach ❤
quality content
@zach - Would you mind breaking down the lengthy video into multiple small videos? I think it would make it easier to watch and help increase your viewership as well.
No
There are timestamps at the bottom. Longer videos are better
Off topic but I have heard quantum computing will replace these languages. So should I start with quantum computing or with you're training
Who told you that?
@@EcZachly_ Saw news somewhere!!
Is this the fourth playlist video or is this the week the 1st week 2 video and the rest videos in the playlist are from week one?
Correct. This is week 2 all at once. The rest are week 1.
@EcZachly If I complete the homeworks after 1 month , will i get certificate or do i need to complete by dec 31st
You need to complete by January 31st not December 31st
Hi Zach, i am from India and i have a question that after completing bootcamp and creating project how i apply for summer internship 2025 .To whom should i contact because no internship anywhere even in LinkedIn.How should i start ?
Ask in discord plz
brilliant
what do you think about elon not sharing twitter data. Retrieve up to 1M Posts per month costs $5000/month
thanks
thank u
NWT- Not with the Team.
DNP- Did not play
DND- Did not dress
Learnt somthing useful that I don't know
Spark does not support writing data to a Hive bucketed table because it uses a different hash function. Additionally, it does not provide any major performance benefits when joining two Hive bucketed tables on the bucket key. How did you solve this problem?
Meta allows Spark to use Hives hash function. You can override these things
Yes it does provide tons of benefits. No shuffle is a huge deal. It has to be configured properly though
@@EcZachly_ I agree, but what I meant was that the open-source Spark 3.3 currently does not utilise the advantages of Hive bucketed table and instead treat it as a regular Hive table during read or join operations.
second comment from Kenya
W
first cmt
lets do this folk