Data engineer interview question | Process 100 GB of data in Spark Spark | Number of Executors

MANISH KUMAR

มุมมอง 30 536

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 5 ก.ย. 2024
In this video, we have discussed how to process 100 GB of data in spark. This is one the famous question asked during interview for data engineering role.
Directly connect with me on:- topmate.io/man...
For more queries reach out to me on my below social media handle.
My Second Channel -- / @competitivegyan1
Interview series Playlist:- • Interview Questions an...
Follow me on LinkedIn:- / manish-kumar-373b86176
Follow Me On Instagram:- / competitive_gyan1
Follow me on Facebook:- / manish12340
My Gear:-
Rode Mic:-- amzn.to/3RekC7a
Boya M1 Mic-- amzn.to/3uW0nnn
Wireless Mic:-- amzn.to/3TqLRhE
Tripod1 -- amzn.to/4avjyF4
Tripod2:-- amzn.to/46Y3QPu
camera1:-- amzn.to/3GIQlsE
camera2:-- amzn.to/46X190P
Pentab (Medium size):-- amzn.to/3RgMszQ (Recommended)
Pentab (Small size):-- amzn.to/3RpmIS0
Mobile:-- amzn.to/47Y8oa4 ( Aapko ye bilkul nahi lena hai)
Laptop -- amzn.to/3Ns5Okj
Mouse+keyboard combo -- amzn.to/3Ro6GYl
21 inch Monitor-- amzn.to/3TvCE7E
27 inch Monitor-- amzn.to/47QzXlA
iPad Pencil:-- amzn.to/4aiJxiG
iPad 9th Generation:-- amzn.to/470I11X
Boom Arm/Swing Arm:-- amzn.to/48eH2we
My PC Components:-
intel i7 Processor:-- amzn.to/47Svdfe
G.Skill RAM:-- amzn.to/47VFffI
Samsung SSD:-- amzn.to/3uVSE8W
WD blue HDD:-- amzn.to/47Y91QY
RTX 3060Ti Graphic card:- amzn.to/3tdLDjn
Gigabyte Motherboard:-- amzn.to/3RFUTGl
O11 Dynamic Cabinet:-- amzn.to/4avkgSK
Liquid cooler:-- amzn.to/472S8mS
Antec Prizm FAN:-- amzn.to/48ey4Pj

ความคิดเห็น • 47

@neelbanerjee7875 ปีที่แล้ว ⁺²⁶
You are actually filling the gap.. much thanks man..!!
request you to kindly make this kind of interactive videos specially on below topics -
1. Repartition with real time scenario. How to determine repartition size depending on data size, cluster size
2. Key salting method - practical/real time case with coading example
3. Data serialization in spark and how it helps on optimization
4. Choosing file type on different scenario (parquet/json/orc)
5. DAG analysis
6. Accumulator - with real time use cases
7. Cache and persist - when to use what
8. garbage collection tuning
9. Real time coding issues faced by data engineers and debugging
10. Version control system for databricks notebook
11. Real time production implementation of bigdata projects..
12. How to perform unit testing for databricks notebooks?
Thanks in advance.. ❤❤
@sohelsayyad5572 ปีที่แล้ว ⁺²
Thank you Manish Bhai, you understand what matters to the aspiring data engineers and what they need to know in depth. really appreciate this.
@pratikj2538 ปีที่แล้ว ⁺¹⁰
Mihir is just bluffing and saying the generic stories. Manish did a good job by interrupting hime. Keep it up.
@sauravsawant2818 7 หลายเดือนก่อน
I don't think barabar hi bolraha hai
@manish_kumar_1 ปีที่แล้ว
Directly connect with me on:- topmate.io/manish_kumar25
@sandeepsoni6628 ปีที่แล้ว
Best Channel for data engineer 👍👍
@mranaljadhav8259 ปีที่แล้ว
Thanks manish for this informative session.. I already had this question in my mind.. I was searching for this question from few days...finally Today you made this video...its like a magic...Thnks a lot man...Please make more videos on such questions which are asked in interview.
@manish_kumar_1 ปีที่แล้ว ⁺¹
Sure
@siddharthguliyani4032 ปีที่แล้ว ⁺⁹
Yeh channel ki reach m aag lagne vali hai bahut tej , kaafi tez upr uthega yeh. Likh ke lelo.
@manish_kumar_1 ปีที่แล้ว ⁺¹
Thank you so much for lovely comments
@AnmolKumar-wd1rv ปีที่แล้ว
Tum mu me lelo
@shashinyn หลายเดือนก่อน
Excellent
@FlashGG1 ปีที่แล้ว ⁺²
Hi @MANISH KUMAR
As per Mihir first approach >> 4:03
he is considering 5 executors with 2 cores each and 10gb memory/executor. In this case,
5*2 = 10 cores in total(10 parallel processes) and
10gb* 5 = 50 gb in total memory
I think 5 executors with the above mentioned configuration will not handle 100gb of data. It can only handle 50 gb.
Correct me if I am wrong.
The calculation mentioned at the end >> 10:53
5 to 6 executors and 4 cores each and 15gb ram/executor seems fine.
@manish_kumar_1 ปีที่แล้ว ⁺⁵
You don't need to load the entire data in memory in one go. Suppose you had 30 GB of memory left in cluster then do you think it will fail?
Actually it won't, rather it will take more time to process. There is a trade off between memory utilization and time. If you load the entire data in one go then it will take less time to process but if you use less memory then it will take more time to process.
@fit1801 ปีที่แล้ว ⁺⁸
That 10gb of each executor will be divided in various parts. Approx 300 mb reserved memory then remaining memory will be divided into 40:60 ratio.
40 percent of 10gb will be used as user memory to Store user defined variables or data.
60 percent of 10 gb will be used as Spark memory n it will be divided into 50:50 ration to storage memory n execution memory.
If I calculate roughly 3gb will be used as execution memory n it will be divided between number of core that's 2.
So each core will be able to handle 1.5 gb of data. However we will be handling the partition of 128 mb.
We total have 10 core , so 10 partition can be processed in parallel.
Total partition is 800 so 80 Cycles would be there to process our 100 gb data.
Suppose one cycle takes 10 seconds then the total time would be 10*80 seconds.
It's a roughly calculation and possibility may vary according to the resource availability.
Hope it's helpful 🙏
I learnt this from Sumit Mittal
@ShivaKumar-dj8bj 4 หลายเดือนก่อน
@@fit1801 Before even coming to the end of your comment, I guessed that you have enrolled in Sumit Sir's course ;) Nice explanation bro, I too had same kind of explanation for this scenario
@surajitpaul6956 4 หลายเดือนก่อน
@@fit1801is Sumit mittal course good. I am having 8+ years of experience in non tech I want to transition to DE. Will the course help me make the transition?
Thanks in advance. 😊
@rh334 11 หลายเดือนก่อน ⁺²
The interviewer asked me about processing PETABYTES of data. Can you explain how to deal with that scenario
@manish_kumar_1 11 หลายเดือนก่อน
Uske liye bahut saari chije consider karni paregi. Sabse pahle unka cluster size kya hai. Uspar depend karega ki kitna time lagega and kaise ham proceed karenge.
@sanooosai 5 หลายเดือนก่อน
thank you sir
@kyou1502 9 หลายเดือนก่อน
memories are not calculated as guess work what he is doing in interview. There is a proper formula to calculate no of executors,cors and memory
@manish_kumar_1 9 หลายเดือนก่อน
Can you please put write that formula here so that everyone can get benefitted
@RaviKumar-gv2wo 11 หลายเดือนก่อน ⁺¹
This is not a correct approach i believe . To process 100 gb of data, block size created would be 800 . We would need more executors to run in parallel . If we rely on the resources explained, it will take much more time than expected.
@jaychavhan6267 ปีที่แล้ว
how much data structure needed for data engineer and how to learn plz make video on this topic...
@manish_kumar_1 ปีที่แล้ว ⁺¹
Sure
@rishav144 ปีที่แล้ว
great video ...Can u make a video on What projects should fresher make for Data Engineer role ?
@jparmar1 ปีที่แล้ว ⁺²
Brother check data engineering zoomcamp.
@rishav144 ปีที่แล้ว ⁺¹
@@jparmar1 are u talking about YT channel DataTalksClub , its available there or some other resource ?
@jparmar1 ปีที่แล้ว ⁺¹
@@rishav144 Yes the same. They have a github repository, so follow the steps as shown there and the tutorials are on youtube.
@manojkumar-kq1nc ปีที่แล้ว
Awesome
@electricalsir 10 หลายเดือนก่อน
Thank you ❤
@mnaveenvamshi3651 ปีที่แล้ว
Thank you very much Manish for your guidance, it is really helpful i am ur new subscriber, my query is , I am good at python developer and intermediate SQL i know, but very much new to spark, i had learnt the spark basics, but can you suggest me one course from where I can learn like this real time questions on spark to process 100 GB data is there any resources in udemy or any other places Thanks in advance, as if i want to career change from python developer to data Engineer
@manish_kumar_1 ปีที่แล้ว ⁺¹
Learn the hard way. Don't look for shortcut. Download the dataset from kaggle and work on that data by yourself. Transform it and write it back to hdfs or any cloud storage bucket.
@mnaveenvamshi3651 ปีที่แล้ว
@@manish_kumar_1thanks for the suggestions, surely I will follow.
@ameygoesgaming8793 6 หลายเดือนก่อน
I don't think he answered question correctly and he was confident before you asking some doubt, but I think interviews me ye jawab nahi chalega, because he was moving his answers around resources, and business and all. BUt that was not asked.
@Manish bhai please, would like to know your approach to this question with calculations
@asktostranger8296 11 หลายเดือนก่อน
Bahi data engineering field me remote jobs
Bhi he ??us Remoye jobs?
@anupamakamepalli5285 7 หลายเดือนก่อน
Can someone the post the content in English too
@raviyadav-dt1tb 8 หลายเดือนก่อน
I unable to understand this question
@girishnigade8115 ปีที่แล้ว
One yr study krke dada Engineer ban sakte hai kya sir...?
@manish_kumar_1 ปีที่แล้ว
Bilkul
@user-dv1ry5cs7e 5 หลายเดือนก่อน
answering roughly and just bluffing
@adityakishan1 9 หลายเดือนก่อน ⁺¹
Solution dene se zyada bandaa bakaiti kar rha hai
@anjibabumakkena ปีที่แล้ว
Expalin in english
@manish7897 ปีที่แล้ว
Where to learn these in depth spark architecture... Any resources/book you'll suggest ?
@manish_kumar_1 ปีที่แล้ว
Spark the definitive guide book
@karansinghrajpurohit3500 ปีที่แล้ว
Bhai aapka Instagram ya Gmail?
@manish_kumar_1 ปีที่แล้ว
Video ke description me hai

ต่อไป

เล่นอัตโนมัติ

hashedin by deloitte interview questions | hashedin by deloitte interview experience