Spark Job, Stages, Tasks | Lec-11

MANISH KUMAR

มุมมอง 35 397

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 3 ต.ค. 2024
In this video I have talked about how jobs, stages and task is created in spark. If you want to optimize your process in Spark then you should have a solid understanding of this concept.
Directly connect with me on:- topmate.io/man...
For more queries reach out to me on my below social media handle.
Follow me on LinkedIn:- / manish-kumar-373b86176
Follow Me On Instagram:- / competitive_gyan1
Follow me on Facebook:- / manish12340
My Second Channel -- / @competitivegyan1
Interview series Playlist:- • Interview Questions an...
My Gear:-
Rode Mic:-- amzn.to/3RekC7a
Boya M1 Mic-- amzn.to/3uW0nnn
Wireless Mic:-- amzn.to/3TqLRhE
Tripod1 -- amzn.to/4avjyF4
Tripod2:-- amzn.to/46Y3QPu
camera1:-- amzn.to/3GIQlsE
camera2:-- amzn.to/46X190P
Pentab (Medium size):-- amzn.to/3RgMszQ (Recommended)
Pentab (Small size):-- amzn.to/3RpmIS0
Mobile:-- amzn.to/47Y8oa4 ( Aapko ye bilkul nahi lena hai)
Laptop -- amzn.to/3Ns5Okj
Mouse+keyboard combo -- amzn.to/3Ro6GYl
21 inch Monitor-- amzn.to/3TvCE7E
27 inch Monitor-- amzn.to/47QzXlA
iPad Pencil:-- amzn.to/4aiJxiG
iPad 9th Generation:-- amzn.to/470I11X
Boom Arm/Swing Arm:-- amzn.to/48eH2we
My PC Components:-
intel i7 Processor:-- amzn.to/47Svdfe
G.Skill RAM:-- amzn.to/47VFffI
Samsung SSD:-- amzn.to/3uVSE8W
WD blue HDD:-- amzn.to/47Y91QY
RTX 3060Ti Graphic card:- amzn.to/3tdLDjn
Gigabyte Motherboard:-- amzn.to/3RFUTGl
O11 Dynamic Cabinet:-- amzn.to/4avkgSK
Liquid cooler:-- amzn.to/472S8mS
Antec Prizm FAN:-- amzn.to/48ey4Pj

ความคิดเห็น • 170

@manish_kumar_1 ปีที่แล้ว ⁺¹
Directly connect with me on:- topmate.io/manish_kumar25
@rimilog27 8 วันที่ผ่านมา
last wala action hit hua then collect ke lia job bna th waha v toh ek stage bnayega woh include q ni hua
@roshankumargupta46 4 หลายเดือนก่อน ⁺⁶
If you don't explicitly provide a schema, Spark will read a portion of the data to infer it. This triggers a job to read data for schema inference. If you disable schema inference and provide your own schema, you can avoid the job triggered by schema inference.
@arpanscreations6954 2 หลายเดือนก่อน
Thanks for your clarification
@fury00713 9 หลายเดือนก่อน ⁺¹⁰
In Apache Spark, the spark.read.csv() method is neither a transformation nor an action; spark.read.csv() is a method used for initiating the reading of CSV data into a Spark DataFrame, and it's part of the data loading phase in Spark's processing model. The actual reading and processing of the data occur later, driven by Spark's lazy evaluation model.
@ChetanSharma-oy4ge 7 หลายเดือนก่อน ⁺¹
Fir action jobs kaise ban rahe hai? Mtlb ager action is equal to jobs , to better way kya hai find out kerne ka?
@roshankumargupta46 4 หลายเดือนก่อน ⁺²
If you don't explicitly provide a schema, Spark will read a portion of the data to infer it. This triggers a job to read data for schema inference.
@shorakhutte1887 10 หลายเดือนก่อน ⁺⁶
bro next level ka explanation tha... thanks for sharing your great knowledge. keep up the good work. Thanks
@KhaderAliAfghan 11 หลายเดือนก่อน ⁺⁵
1 job for read,
1 job for print 1
1 job for print 2
1 job for count
1 job for collect
total 5 jobs according to me but i have not run the code not sure
@piyushzope10 20 วันที่ผ่านมา
Yes I also got 5 jobs. Not sure how Manish got only 2
@satyammeena-bu7kp 3 หลายเดือนก่อน
Really Awsome Explanation ! Esa Explanation kabhi or ni mil sakta hai Thank you so much
@samg3434 17 วันที่ผ่านมา
Bahot shandaar explanation
@nishasreedharan6175 5 หลายเดือนก่อน
One of the best videos ever . Thank you for this . Really helpful.
@Food_panda-hu6sj ปีที่แล้ว ⁺²
One question:
after groupby by default 200 partitions will be created where each partition will hold data for individual key.
What happens if there are less keys like 100 , will it lead to formation of only 100 partition insted of 200?
AND
What happens if the individual keys are more than 200 in number, will it create more than 200 partitions?
@ShrinchaRani ปีที่แล้ว ⁺¹
Explained so well that too bit- by- bit 👏🏻
@naturehealingandpeace2658 9 หลายเดือนก่อน
Wow Kya clear explanation tha,first time understood in.one.go
@ChetanSharma-oy4ge 7 หลายเดือนก่อน ⁺¹
Ek question tha ki, order kya hona chahaye likhne ka, Mtlb ki ager hum filter/select/partition/group by/distinct/count ya or kuch bhi ker rahe hai to, sabsay pehla kya likhna chahaye…
@yashkumarjha5733 หลายเดือนก่อน
Bro dekho agar optimized way me me likhna chah re ho then first apply filter and then apply transformations.
For example : agar Maan lo k mere pass data hai 100 employees ka or mujhe sirf 90000 se greater salary vaale employees chahiye or mujhe unn sabhi employees ko promotion Krna hai matlab k sabhi ka salary or badhaana hai. Toh suppose iss case me tu pehle transformation lagaoge k salary + vaala then filter kroge toh poora 100 employees ka data scan hoga but agar pehle hi filter laga loge or suppose 90000 se jyada salary vaale sirf 2 log hai toh agar pehle filter laga lenge then we just need to scan only 2 employees data.
Sorry example thoda kharab tha but shyd concept samajh aa gaya hoga.
By the way agar aap pehle transformation lagane k baad bhi filter laga re ho toh bhi koi dikkat nahi hoga kyuki Spark internally is designed in a way k vo optimized tareeke se hi run krega toh jab tumhaara poora operation perform ho gaya uske baad tum jab job hit kroge then it'll first do filter and then apply other transformations on top of it
Spark is very intelligent and will do in optimized way.
I hope this answers your question
@stuti5700 ปีที่แล้ว ⁺²
very good content. Please make detail videoes on spark job optimization
@SuresgjmJ 7 หลายเดือนก่อน ⁺⁴
I have one doubt there are 3 actions are there such as read,collect and count, but why it is creating 2 job only ?
@tejathunder 5 หลายเดือนก่อน ⁺¹
In Apache Spark, the read operation is not considered an action; it is a transformation.
@mrinalraj4801 5 หลายเดือนก่อน
Great Manish. I am grateful to you for making such rare content with so much depth. You are doing a tremendous job by contributing towards the Community. Please keep up the good work and stay motivated. We are always here to support you.
@Rajeshkumbhkar-x6v 2 หลายเดือนก่อน
Amazing explanation sir jii
@rohitbhawle8658 ปีที่แล้ว ⁺¹
nice explain ,each and every concept you clear keep it up
@Useracwqbrazy ปีที่แล้ว
I really liked this video....nobody explained at this level
@rahuljain8001 ปีที่แล้ว
Didn't find such a detailed explanation, Kudos
@asif50786 ปีที่แล้ว ⁺¹
Start a playlist with guided projects ,,so that we can apply these things in real life..
@utkarshkumar6703 7 หลายเดือนก่อน
Bhai bahut sahi explain karte ho aap
@mantukumar-qn9pv ปีที่แล้ว
Thanks Manish Bhai...Please keep continue your video
@akhiladevangamath1277 4 หลายเดือนก่อน
Thank you so much Manish
@nikhilhimanshu9758 8 หลายเดือนก่อน
kya hota agar filter ke baad ek aur narrow transformation hota like filter --> flatmap--> select iska kitna task banta ?
@arshadmohammed1090 5 หลายเดือนก่อน
great job bro, you are doing well.
@saumyasingh9620 ปีที่แล้ว
This was so beautifully explained.
@AnandVerma 6 หลายเดือนก่อน ⁺¹
Num Jobs = 2
Num Stages = 4 (job1 = 1, job2 = 3)
Num Tasks = 204 (job1 = 1, job2 = 203)
@shreyaspurankar9736 2 หลายเดือนก่อน
great explanation :)
@ADESHKUMAR-yz2el 5 หลายเดือนก่อน
bhaiya you are grate
@kyransingh8209 9 หลายเดือนก่อน
@manish_kumar_1 : correction - job2 - stage2 is till group by, and job2 - stage 3 is till collect
@tanushreenagar3116 8 หลายเดือนก่อน
VERY VERY HELPFUL
@maruthil5179 ปีที่แล้ว
Very well explained bhai.
@saumyasingh9620 ปีที่แล้ว
When I ran in notebook, it gave 5 jobs like below, and not only 2 for this snippet of code. Can you explain.:
Job 80 View(Stages: 1/1)
Stage 95: 1/1
Job 81 View(Stages: 1/1)
Stage 96: 1/1
Job 82 View(Stages: 1/1)
Stage 97: 1/1
Job 83 View(Stages: 1/1, 1 skipped)
Stage 98: 0/1 skipped
Stage 99: 2/2
Job 84 View(Stages: 1/1, 2 skipped)
Stage 100: 0/1 skipped
Stage 101: 0/2 skipped
Stage 102: 1/1
@KapilKumar-hk9xk 2 หลายเดือนก่อน
Excellent explaination. One question, with groupBy 200 tasks are created but most of these tasks are useless right. How to avoid such scenarios. Coz it will take extra effort for spark for scheduling such empty partition task right...
@rushikeshsalunkhe8892 หลายเดือนก่อน ⁺¹
You can repartition it to less number of partition or you can tweak the spark.sql.shuffle.partition config by setting it to desirable number.
@Rafian1924 ปีที่แล้ว ⁺¹
Your channel will grow immensely bro. Keep it up❤
@vinitsunita ปีที่แล้ว
Very good explaination bro
@AnkitNub 4 หลายเดือนก่อน
I have a question, if one job has 2 consecutive wide dependency transformation then 1 narrow dependency and again 1 wide dependency how many stages will be created. Suppose repartition, after that groupby, then filter and then join, how many stages will this create?
@jaisahota4062 3 หลายเดือนก่อน
same question
@salmansayyad4522 5 หลายเดือนก่อน
excellent!!
@rimilog27 20 วันที่ผ่านมา
aapne video me 4 stage bnaya but spark ui me 3 hi stage kaise bnane, read me hi toh ek stage bna th
@sugandharaghav2609 ปีที่แล้ว
I remember wide dependency you explained in shuffling.
@sudipmukherjee6878 ปีที่แล้ว
Bhai was eagerly waiting for your videos
@lalitgarg8965 5 หลายเดือนก่อน
count is also a action, so there would be 3 jobs?
@deepaksharma-xr1ih 2 หลายเดือนก่อน
good job man
@rajkumardubey5486 2 หลายเดือนก่อน
Count bhi ek action hoga na means 3 job create hua
@jilsonjoe6259 ปีที่แล้ว
Great Explanation ❤
@mayankkandpal1565 10 หลายเดือนก่อน
nice explanation.
@GauravKumar-im5zx ปีที่แล้ว
glad to see you brother
@prashantsrivastava9026 ปีที่แล้ว
Nicely explained
@samirdeshmukh9886 ปีที่แล้ว
thank you sir..
@shekharraghuvanshi2267 ปีที่แล้ว ⁺¹
Hi Manish,
in the second job there were 203 tasks and 1st job there was 1, so in total 204 tasks are there in complete application? i am bit confused between 203 and 204.
Kindly clarify..
@parthagrawal5516 6 หลายเดือนก่อน
After repartition(3)
Still 200 default partition will show there on dag Sir
@CctnsHelpdesk 6 หลายเดือนก่อน
7:40
print is an action ,so it should be 4 job in given code. ryt????correct me if i am wrong
@kumarankit4479 4 หลายเดือนก่อน
Hi bhaiya. Why havent we considered collect() as a job creator here in the program you discussed?
@VenkataJaswanthParla 4 หลายเดือนก่อน
Hi Manish,
Count() is also an action right ? If not can you please explain what is count()
@jatinyadav6158 8 หลายเดือนก่อน ⁺¹
Hi @manish_kumar_1,
I have one question in wide transformation you said that in groupBy stage3 there will be 200 tasks according to the 200 partitions. But can you tell me why these 200 partitions happened in the first place.
@kumarabhijeet8968 ปีที่แล้ว
Manish bhai total kitne videos rhenge theory or practical wale series mein?
@sharma-vasundhara 7 หลายเดือนก่อน ⁺¹
Sir, in line 14 - we have .groupby and .count
.count is an action, right? Not sure if you missed it by mistake or if it doesn't count as an action? 🙁
@tanmayagarwal3481 6 หลายเดือนก่อน
I had the same doubt,Did you get the answer to this question? As per the UI also it has mentioned only 2 jobs whereas count should be an action :(
@codingwithanonymous890 11 หลายเดือนก่อน
Sir make playlist for other data engineer tools also
@sairamguptha9988 ปีที่แล้ว
Glad to see you manish... Bhai any update on project details?
@mohitgupta7341 ปีที่แล้ว
Bro amazing explanation
@deeksha6514 6 หลายเดือนก่อน
An Executor can have 2 partitions or is it like that partitions means it will be there on two different machines.
@gujjuesports3798 10 หลายเดือนก่อน
.count on group dataset is transformation not action, correct ? if it was like employee_df.count() then it would be action
@lucky_raiser ปีที่แล้ว
bro, how many more days will spark series take and will you make any complete DE project with spark at last.
BTW watched and implemented all your theory and practical videos. Great sharing❤
@jay_rana 6 หลายเดือนก่อน
what if the spark.sql.shuffle.partitions is set to some value, in this case what will be the no of tasks after/in groupby stage ?
@wgood6397 6 หลายเดือนก่อน
pls enable subtitle bro
@mr.random2001 9 หลายเดือนก่อน
@manish_kumar_1 In the previous videos you said like count() as action, but in these video you are not taking that as action, WHY ??
@serenitytime1959 4 หลายเดือนก่อน
could you please upload the required files, I just want to run and see by myself.
@ramyabhogaraju2416 ปีที่แล้ว
I have been waiting for your video how many more days will it take spark series to complete
@rohanchoudhary672 4 หลายเดือนก่อน
df.rdd.getNumPartitions(): Output = 1
df.repartition(5)
df.rdd.getNumPartitions(): Output = 1
Using community databricks sir
@manish_kumar_1 4 หลายเดือนก่อน ⁺¹
Yes I don't see any issue. If you won't assign your repartition df to some variable then you will get same result
@codetechguru1 7 หลายเดือนก่อน
why not stage 4, becuse you say each job has minimum one stage and one task so why job 3 don't included to stage and task ?
@dpkbit08 7 หลายเดือนก่อน
Same example I tried and in my case, 4 jobs are created. Is there any other config needed?
@wgood6397 6 หลายเดือนก่อน
pls enable subtite in all video bro
@moyeenshaikh4378 11 หลายเดือนก่อน
bhai job 2 to collect se chalu hoga na? to read ke bad se collect tak job 1 hi chalega na?
@pranay-q1e ปีที่แล้ว
Can i know about executors.
How many executors will be there in worker node?
And
Is the no.of executors depend on no.of cores in worker node?
@gauravkothe9558 10 หลายเดือนก่อน
Hii sir, i have one doubt like collect will create the task and stage or not because you mentioned like 203 task
@aditya9c 6 หลายเดือนก่อน
waah
@PrashantMishra-r8s 10 หลายเดือนก่อน
One query @Manish , spark.read is a transformation and not an action right?
@sachindramishra2813 7 หลายเดือนก่อน
df=spark.read.parquet()
print(df.rdd.getNumPartitions())
df=df.repartition(2)
print(df.rdd.getNumPartitions())
df=df.groupby("AccountId").count()
df.collect()
Why this code creates 5 Spark jobs in Databricks?
I've also used 2 actions only
@divyabhansali2182 ปีที่แล้ว
Is 200 default task in group by even if only 3 distinct ages are there?If so what will be there in rest of the 197 task (which age group will be there)
@sravankumar1767 6 หลายเดือนก่อน
Nice explanation 👌 👍 👏 but can you please explain in English. Every one can see all over the world 🌎 ✨
@sharmadtadkodkar3731 6 หลายเดือนก่อน
What command did you use to run the job?
@pankajrathod5906 9 หลายเดือนก่อน
File ka data kaha se lu.. aapne data kaha diya hai
@venugopal-nc3nz ปีที่แล้ว ⁺¹
Hi Manish, In databricks also when groupby() invoke it create 200 task by default ? How to reduce 200 task when using group by() for optimizing spark job ?
@manish_kumar_1 ปีที่แล้ว
There is a configuration which can be set. Just google how can I set fix number of partition after join
@princekumar-li6cm 5 หลายเดือนก่อน
Count is also an action.
@manish_kumar_1 5 หลายเดือนก่อน
And transformation too
@manish_kumar_1 5 หลายเดือนก่อน
And transformation too
@devjyotipattnaik8588 2 หลายเดือนก่อน
Great explanation manish!!
As per my understanding from the video, total 2 jobs,4 stages,204 tasks will be created
job 1 - Read - which consist 1 stage and 1 task
from job1 -> till job 2 gets created we have 3 stages and 203 tasks
stage 2 - Repartition (wide dependency transformation) - 1 task is created
stage 3 - select and filter(narrow dependency transformation) - 2 tasks are created - > 1 for each transformation
stage 4 - group by(wide dependency transformation) - 200 tasks are created
Plz correct me,if i am wrong
@narag9802 9 หลายเดือนก่อน
do you have English version of videos
@garvitgarg6749 11 หลายเดือนก่อน ⁺¹
Bhai, I'm using Spark 3.4.1 and in that when I group data using groupby (I have 15 records in dummy dataset) it create 4 jobs to process 200 partitions why ? Is this the latest enhancement ? and not only in latest version but also in spark 3.2.1 I observed same thing. Could you please explain this ?
@villagers_01 3 หลายเดือนก่อน
bhai job action pe create hota h but no. of action != no of jobs. kyonki jobs create hota h jab new rdd ka jarurat hota h. in groupby we need shuffling of data aur rdd immutable hota h to naya rdd banana hi padta h after shuffling. isliye jab bhi naya rdd banane ka jarurat hota h to ek job create hota.aapke case 1 job read ke liye, 1 job schema ke liye,1 job shuffling ke liye aur ek job display ke liye.
@RahulAgarwal-uh3pf 4 หลายเดือนก่อน
jo code snipet suru m dikhaya h usme count bhi ek job h right?
@manish_kumar_1 4 หลายเดือนก่อน
Nhi aage ke lecture me aapko pata chalega why
@AnkitaSakseria 5 หลายเดือนก่อน
count is also an action , why job not created for it??
@manish_kumar_1 5 หลายเดือนก่อน
Count is an action and transformation both. Aage ke lectures me pata chal jayega
@riteshrajsingh7437 4 หลายเดือนก่อน
Hi manish, i have a doubt in groupby count is also a action then why it is not counted as a action?
@manish_kumar_1 4 หลายเดือนก่อน
Aage kuch videos me clear ho jayega
@Watson22j ปีที่แล้ว
koi baat nhi bhaia, bs ye series poora khatam kr dena kyoki etne detail me yt pe kisi nhi nhi btaya hai.
@prabhatsingh7391 ปีที่แล้ว
Hi Manish , count is also an action and you have written count just after group by in code snippet,why count is not considered as job here.
@manish_kumar_1 ปีที่แล้ว
Aage me videos me iska explanation mil jayega. Count action v hai and transformation v. Kab kon sa hoga uske liye aage videos me detailed me explain kara hai
@ankitachauhan6084 4 หลายเดือนก่อน
why was count() not counted as action while countuing in jobs ?
@manish_kumar_1 4 หลายเดือนก่อน
Aage ke lecture me samjh me aayega
@ranvijaymehta ปีที่แล้ว
Thanks Sir
@lifeisfun9 11 หลายเดือนก่อน ⁺¹
count() bhi to action h na sir?
@manish_kumar_1 11 หลายเดือนก่อน
Aage ke lecture me aapko pata chalega. It is both action and transformation.
@kunalk3830 11 หลายเดือนก่อน
Bhai mujhe toh ye 3 jobs, 3 stages and 4 tasks dikha raha. Job 0 for load with 1 task --> Job 1 for collect with 2 task --> Job 3 for collect with 1 task but is showing skipped. Didn't get whats wrong used the same code data but used different data 7MB size.
@manish_kumar_1 11 หลายเดือนก่อน ⁺¹
No need to get to rigid here. Spark bahut saare optimization karta hai and multiple time some jobs get skipped. Maine controlled environment me Kiya tha to show how does that work. During project development you are not going to count how many jobs,stages or task are there. So even if you don't get the same number just chill
@mahnoorkhalid6496 ปีที่แล้ว
I have executed the same job, but it created 4 jobs with each having 1 stage and 1 task.
I think for every wide transformation, it created a new job. Please please confirm and guide.
@villagers_01 3 หลายเดือนก่อน
bhai job action pe create hota h but no. of action != no of jobs. kyonki jobs create hota h jab new rdd ka jarurat hota h. in groupby we need shuffling of data aur rdd immutable hota h to naya rdd banana hi padta h after shuffling. isliye jab bhi naya rdd banane ka jarurat hota h to ek job create hota.
@shubhambhosale8467 8 หลายเดือนก่อน
203 task or 204 task plz clear this
@MiliChoksi-gc8if 5 หลายเดือนก่อน
So if any group by is there we have to consider 200 task?
@manish_kumar_1 5 หลายเดือนก่อน
Yes if aqe is disabled. If it is enabled then count depends on data volume and default parallelism
@sivamani7711 5 หลายเดือนก่อน
Hi Manish, can you make same content in English
@manish_kumar_1 5 หลายเดือนก่อน ⁺¹
I am planning to shoot whole spark series in English too. But not yet finalized when will I start
@sivamani7711 5 หลายเดือนก่อน
@@manish_kumar_1thanks

ต่อไป

เล่นอัตโนมัติ