Big Data Hadoop Spark Cluster on AWS EMR Cloud | Big Data on AWS Cloud | Production Big Data Cluster

Sumit Mittal

มุมมอง 53 292

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 28 ธ.ค. 2024

ความคิดเห็น • 64

@sumitmittal07 2 ปีที่แล้ว
Checkout the Big Data course details here: trendytech.in/?referrer=youtube_bd22
@gebrilyoussef6851 3 หลายเดือนก่อน
Sumit, you are the master trainer of Big Data. Thank you so much for all the efforts you made.
@anuragdubey5898 2 ปีที่แล้ว ⁺²
Very informative session. Have learnt a lot and even cleared my doubts as well. Easy and simplified way of explanation made it a best video for AWS use in Bigdata. Thanks for the session..
@NaturalPro100 4 ปีที่แล้ว ⁺²
This really cleared some basics I required for starting spark with AWS.Content and explanation is up to the point.Thanks for sharing Sumit.
@VallabhGhodkeB 2 ปีที่แล้ว
Top Stuff this is. just got started way to go
@sampaar 4 ปีที่แล้ว ⁺¹
Amazing presentation. Better than many of the udemy courses that I have come across.
@udaynayak4788 2 ปีที่แล้ว
one of the best informative session , thank you so much for sharing.
@mdabdulmujeebmalik422 4 ปีที่แล้ว
Excellent video on AWS and how to run spark job on AWS. Amazing!!. Thank You so much for the video and kudos to the instructor.
@RaviKumar-oy5jq 4 ปีที่แล้ว ⁺²
Excellent session ..
@laxmisuresh 3 ปีที่แล้ว
Very meaningful presentation. Explained in correct pace and with proper content.
@kashamp9388 2 ปีที่แล้ว ⁺¹
best session ever. concise
@sumitmittal07 2 ปีที่แล้ว
Glad you are liking the my teaching :)
@datoalavista581 2 ปีที่แล้ว
Thank you for sharing
@gauravrai4398 4 ปีที่แล้ว ⁺¹
Very lucid and concise explanation .... A job well done!
@sumitmittal07 4 ปีที่แล้ว
Thank you Gaurav
@sohailhosseini2266 2 ปีที่แล้ว ⁺¹
Thanks for the video!
@pankajnakil6173 3 ปีที่แล้ว
Very useful & good explanation..
@NIHAL960 4 ปีที่แล้ว ⁺⁵
S3: Amazon storage
On demand instance : Available on demand
Spot instance : Available at high discounts for temporary basis, can be taken back with 2 min warning
Reserved instance : Available if commitment is long such as a year at discounted price as compared to on demand
Types of nodes:
1. Master Node: This manages the cluster. This is single ec2 instance.
2. Core Node: Each cluster has one or more core node, It hosts data and runs tasks
3. Task Node: This can only run task and not store,, Required if application is compute heavy. Spot instance are good choice for it.
Cluster:
1 Transient cluster terminates automatically.
2. long running cluster requires manual termination.
@sumitmittal07 4 ปีที่แล้ว
Nice Summarization.. thanks much
@amitbajpai6209 4 ปีที่แล้ว ⁺¹
Best video to get an overall understanding of AWS EMR..
It was really helpful 😊
Kudos to the Instructor !!
Liked 👍 and Subscribed..
Hoping for more such videos..
@sumitmittal07 4 ปีที่แล้ว ⁺¹
Glad it was helpful!
@vairammoorthy6665 4 ปีที่แล้ว
best tutorial for AWS EMR
@sridharreddy9605 4 ปีที่แล้ว
very clear explaination thank you for your time...
@ririraman7 2 ปีที่แล้ว ⁺¹
beautiful
@shilparathore8849 4 ปีที่แล้ว ⁺¹
Very well explained thanks for sharing
@ramprasadbandari8195 4 ปีที่แล้ว ⁺¹
Excellent explanation and very useful Info!!
@sumitmittal07 4 ปีที่แล้ว
Glad you liked it
@divakarluffy3773 3 ปีที่แล้ว
one video resolved all my doubts , Thanks
@sumitmittal07 2 ปีที่แล้ว ⁺¹
Happy to hear that your doubts are resolved
@dharmeswaranparamasivam5498 4 ปีที่แล้ว
Very good session. Thanks for doing this.
@vijeandran 3 ปีที่แล้ว
Neat explanation.... and very very informative video....
@puneetnaik8719 4 ปีที่แล้ว
Great explanation sir..thanks for video.
@swaroupbanikk4444 2 ปีที่แล้ว
BEST
@satishj801 2 ปีที่แล้ว
@1:14:30 , he downloaded the jar from S3 but he didn't copy it to hdfs just like he copied book-data.txt and he mentioned he is running the jar from hdfs not from S3 , but its the same step as @ 1:06:52 , I'm bit confused at that point . If some one has understood please drop a reply.
@user-co8oc1rm5w 2 ปีที่แล้ว
jar file he kept in root path of the cluster but the file to access for processing that he kept in hdfs e.g the directory which he created named '/data'. thats y he menioned he is running the jar from hdfs because the file to be processed i.e. book-data.txt he downloaded to hdfs in place of s3 then he changed the file location in scala code then recreated jar and placed that jar to s3 first then downloaded that jar from s3 to master node and executed spark job to process the book-data.txt file from hdfs not from s3.
@satishj801 2 ปีที่แล้ว
@@user-co8oc1rm5w Thanks for the explanation👌🏻
@Naveen-xi7os 4 ปีที่แล้ว
it was awesome session
@fzgarcia 4 ปีที่แล้ว
Thank you, nice presentation!
@techtransform 3 ปีที่แล้ว
Excellent Explanation :)
@diptyojha174 4 ปีที่แล้ว ⁺¹
Very nice explanation
@sumitmittal07 4 ปีที่แล้ว
thank you Dipty
@dineshughade6570 ปีที่แล้ว
Nice explanation. Can we have a pdf of this video?
@keyursolanki ปีที่แล้ว
will there be default allow access to s3 from emr cluster?
@amulmgr 4 ปีที่แล้ว
thankyou very much for video
@piby1802 4 ปีที่แล้ว
Really nice presentation! Thank you!
@sumitmittal07 4 ปีที่แล้ว
Glad you liked it!
@BinduReddy-n1q ปีที่แล้ว
How to save the wordcount output in HDFS and also in S3.
@Dyslexic_Neuron 4 ปีที่แล้ว
very good explanation . Can u make a video on spark shuffle and issues
@subratakr5353 4 ปีที่แล้ว
Thanks for the lovely presentation!
Had 2 questions though :
1) When you say your are running code in master do you mean namenode of the cluster ? Where is the namenode for this this cluster ?
2) Since data is stored in S3 does EMR copy it to hdfs and then spark reads from hdfs eventually ? in which hdfs path is the data stored ?
@vijeandran 3 ปีที่แล้ว ⁺²
Answer 1: Here namenode, driver node, edge node and master node all are the same.
2. As soon as you create one master and 2 slave nodes, These slave node's harddisk behaves like HDFS, and spark will fetch files from this disk and run as in memory of the slave nodes. Here 1 master and slave acts as a processing unit that is part inside AWS. S3 also is a part of AWS and it is a storage unit. When you want to process data, you are copying the data file from storage part s3 to processing part HDFS, where HDFS is present in the 2 slave nodes that you created. then use can run spark jar file.
@priyabhatia4107 4 ปีที่แล้ว
Great content!!
@AparnaBL 4 ปีที่แล้ว ⁺¹
Moreover hdfs data is ephemeral right ...if you want the data to exist even after cluster is terminated ...we can use S3
@sumitmittal07 4 ปีที่แล้ว
absolutely. you can see same thing is mentioned around 24th minute of the session
@AparnaBL 4 ปีที่แล้ว ⁺²
@@sumitmittal07 yeah @ 22:36
@rrjishan 3 ปีที่แล้ว
as we say , on amazon aws we can shut down the cluster after computation and data will be saved in s3 . So, clusters only responsible to compute data? isn't data also stored in clusters. Getting bit confusing..please clear it
@sancharighosh8204 3 ปีที่แล้ว
Can you make some tutorials on Databricks
@gaurav1825 4 ปีที่แล้ว
Sir please give some guidance of AWS EMR with Apache Flink and Hudi .
@anuj3922 3 ปีที่แล้ว
EMR cluster is on hourly rate --if. I don't use it do I still have to pay for it--if I build it just for learning purpose and come back to it as per my learning scope ?
@fzgarcia 4 ปีที่แล้ว
Do you know if in free tier account I can run a EMR cluster like this?
Even if I can only run micro t3 in free tier, I can create a manual cluster with minimum 3 nodes of micro t3 or more nodes? Thanks.
@SpiritOfIndiaaa 4 ปีที่แล้ว
Thanks , but why hdfs data gone when cluster shutdown ? as hdfs is persistant when cluster is up it would be automatically available right ?
@vijeandran 3 ปีที่แล้ว
When you start the cluster you are creating three instances... one for master and two for datanode. These nodes are available only for that session, because it is virtual only for that session, once you terminate the cluster, the instance created 1 master and 2 slave will be killed and due to that data present in the HDFS will be deleted. As summit said if you want to run your cluster continuously then the data would be available in HDFS, where the amazon will put more bill for continuous usage of cluster.
@rajsekhargada9212 2 ปีที่แล้ว
I think S3 is not distributed storage
@sumitmittal07 2 ปีที่แล้ว
its a object store, but in this scenario its a replacement of distributed storage and serving similar usecase.
@makdan1331 4 ปีที่แล้ว
where is the jar file?

ต่อไป

เล่นอัตโนมัติ

Big Data Interview Question | Spark Interview Question | Spark with Scala Coding Interview Question