Data engineer interview question | Process 100 GB of data in Spark Spark | Number of Executors
ฝัง
- เผยแพร่เมื่อ 5 ก.ย. 2024
- In this video, we have discussed how to process 100 GB of data in spark. This is one the famous question asked during interview for data engineering role.
Directly connect with me on:- topmate.io/man...
For more queries reach out to me on my below social media handle.
My Second Channel -- / @competitivegyan1
Interview series Playlist:- • Interview Questions an...
Follow me on LinkedIn:- / manish-kumar-373b86176
Follow Me On Instagram:- / competitive_gyan1
Follow me on Facebook:- / manish12340
My Gear:-
Rode Mic:-- amzn.to/3RekC7a
Boya M1 Mic-- amzn.to/3uW0nnn
Wireless Mic:-- amzn.to/3TqLRhE
Tripod1 -- amzn.to/4avjyF4
Tripod2:-- amzn.to/46Y3QPu
camera1:-- amzn.to/3GIQlsE
camera2:-- amzn.to/46X190P
Pentab (Medium size):-- amzn.to/3RgMszQ (Recommended)
Pentab (Small size):-- amzn.to/3RpmIS0
Mobile:-- amzn.to/47Y8oa4 ( Aapko ye bilkul nahi lena hai)
Laptop -- amzn.to/3Ns5Okj
Mouse+keyboard combo -- amzn.to/3Ro6GYl
21 inch Monitor-- amzn.to/3TvCE7E
27 inch Monitor-- amzn.to/47QzXlA
iPad Pencil:-- amzn.to/4aiJxiG
iPad 9th Generation:-- amzn.to/470I11X
Boom Arm/Swing Arm:-- amzn.to/48eH2we
My PC Components:-
intel i7 Processor:-- amzn.to/47Svdfe
G.Skill RAM:-- amzn.to/47VFffI
Samsung SSD:-- amzn.to/3uVSE8W
WD blue HDD:-- amzn.to/47Y91QY
RTX 3060Ti Graphic card:- amzn.to/3tdLDjn
Gigabyte Motherboard:-- amzn.to/3RFUTGl
O11 Dynamic Cabinet:-- amzn.to/4avkgSK
Liquid cooler:-- amzn.to/472S8mS
Antec Prizm FAN:-- amzn.to/48ey4Pj
You are actually filling the gap.. much thanks man..!!
request you to kindly make this kind of interactive videos specially on below topics -
1. Repartition with real time scenario. How to determine repartition size depending on data size, cluster size
2. Key salting method - practical/real time case with coading example
3. Data serialization in spark and how it helps on optimization
4. Choosing file type on different scenario (parquet/json/orc)
5. DAG analysis
6. Accumulator - with real time use cases
7. Cache and persist - when to use what
8. garbage collection tuning
9. Real time coding issues faced by data engineers and debugging
10. Version control system for databricks notebook
11. Real time production implementation of bigdata projects..
12. How to perform unit testing for databricks notebooks?
Thanks in advance.. ❤❤
Thank you Manish Bhai, you understand what matters to the aspiring data engineers and what they need to know in depth. really appreciate this.
Mihir is just bluffing and saying the generic stories. Manish did a good job by interrupting hime. Keep it up.
I don't think barabar hi bolraha hai
Directly connect with me on:- topmate.io/manish_kumar25
Best Channel for data engineer 👍👍
Thanks manish for this informative session.. I already had this question in my mind.. I was searching for this question from few days...finally Today you made this video...its like a magic...Thnks a lot man...Please make more videos on such questions which are asked in interview.
Sure
Yeh channel ki reach m aag lagne vali hai bahut tej , kaafi tez upr uthega yeh. Likh ke lelo.
Thank you so much for lovely comments
Tum mu me lelo
Excellent
Hi @MANISH KUMAR
As per Mihir first approach >> 4:03
he is considering 5 executors with 2 cores each and 10gb memory/executor. In this case,
5*2 = 10 cores in total(10 parallel processes) and
10gb* 5 = 50 gb in total memory
I think 5 executors with the above mentioned configuration will not handle 100gb of data. It can only handle 50 gb.
Correct me if I am wrong.
The calculation mentioned at the end >> 10:53
5 to 6 executors and 4 cores each and 15gb ram/executor seems fine.
You don't need to load the entire data in memory in one go. Suppose you had 30 GB of memory left in cluster then do you think it will fail?
Actually it won't, rather it will take more time to process. There is a trade off between memory utilization and time. If you load the entire data in one go then it will take less time to process but if you use less memory then it will take more time to process.
That 10gb of each executor will be divided in various parts. Approx 300 mb reserved memory then remaining memory will be divided into 40:60 ratio.
40 percent of 10gb will be used as user memory to Store user defined variables or data.
60 percent of 10 gb will be used as Spark memory n it will be divided into 50:50 ration to storage memory n execution memory.
If I calculate roughly 3gb will be used as execution memory n it will be divided between number of core that's 2.
So each core will be able to handle 1.5 gb of data. However we will be handling the partition of 128 mb.
We total have 10 core , so 10 partition can be processed in parallel.
Total partition is 800 so 80 Cycles would be there to process our 100 gb data.
Suppose one cycle takes 10 seconds then the total time would be 10*80 seconds.
It's a roughly calculation and possibility may vary according to the resource availability.
Hope it's helpful 🙏
I learnt this from Sumit Mittal
@@fit1801 Before even coming to the end of your comment, I guessed that you have enrolled in Sumit Sir's course ;) Nice explanation bro, I too had same kind of explanation for this scenario
@@fit1801is Sumit mittal course good. I am having 8+ years of experience in non tech I want to transition to DE. Will the course help me make the transition?
Thanks in advance. 😊
The interviewer asked me about processing PETABYTES of data. Can you explain how to deal with that scenario
Uske liye bahut saari chije consider karni paregi. Sabse pahle unka cluster size kya hai. Uspar depend karega ki kitna time lagega and kaise ham proceed karenge.
thank you sir
memories are not calculated as guess work what he is doing in interview. There is a proper formula to calculate no of executors,cors and memory
Can you please put write that formula here so that everyone can get benefitted
This is not a correct approach i believe . To process 100 gb of data, block size created would be 800 . We would need more executors to run in parallel . If we rely on the resources explained, it will take much more time than expected.
how much data structure needed for data engineer and how to learn plz make video on this topic...
Sure
great video ...Can u make a video on What projects should fresher make for Data Engineer role ?
Brother check data engineering zoomcamp.
@@jparmar1 are u talking about YT channel DataTalksClub , its available there or some other resource ?
@@rishav144 Yes the same. They have a github repository, so follow the steps as shown there and the tutorials are on youtube.
Awesome
Thank you ❤
Thank you very much Manish for your guidance, it is really helpful i am ur new subscriber, my query is , I am good at python developer and intermediate SQL i know, but very much new to spark, i had learnt the spark basics, but can you suggest me one course from where I can learn like this real time questions on spark to process 100 GB data is there any resources in udemy or any other places Thanks in advance, as if i want to career change from python developer to data Engineer
Learn the hard way. Don't look for shortcut. Download the dataset from kaggle and work on that data by yourself. Transform it and write it back to hdfs or any cloud storage bucket.
@@manish_kumar_1thanks for the suggestions, surely I will follow.
I don't think he answered question correctly and he was confident before you asking some doubt, but I think interviews me ye jawab nahi chalega, because he was moving his answers around resources, and business and all. BUt that was not asked.
@Manish bhai please, would like to know your approach to this question with calculations
Bahi data engineering field me remote jobs
Bhi he ??us Remoye jobs?
Can someone the post the content in English too
I unable to understand this question
One yr study krke dada Engineer ban sakte hai kya sir...?
Bilkul
answering roughly and just bluffing
Solution dene se zyada bandaa bakaiti kar rha hai
Expalin in english
Where to learn these in depth spark architecture... Any resources/book you'll suggest ?
Spark the definitive guide book
Bhai aapka Instagram ya Gmail?
Video ke description me hai