AWS Glue ETL Vs EMR - Which one should I use?

แชร์
ฝัง
  • เผยแพร่เมื่อ 6 มิ.ย. 2024
  • ℹ️ aws.amazon.com/emr/
    ❔www.thequestionbank.io
    ℹ️ johnnychivers.co.uk
    ☕ www.buymeacoffee.com/johnnych...
    00:00 - Intro
    00:36 - What is EMR?
    01:26 - What is AWS Glue?
    02:11 - When do I use EMR? When do I use Glue?
    In this video we take a look at the use cases surrounding AWS Glue and AWS EMR. It's a common question on which one should I use and when. I therefore attempt to answer this question in a highly requested video.
    😎 About me
    I have spent the last decade being immersed in the world of big data working as a consultant for some the globe's biggest companies. My journey into the world of data was not the most conventional. I started my career working as performance analyst in professional sport at the top level's of both rugby and football. I then transitioned into a career in data and computing. This journey culminated in the study of a Masters degree in Software development. Alongside many a professional certification in AWS and MS SQL Server.
  • แนวปฏิบัติและการใช้ชีวิต

ความคิดเห็น • 49

  • @nickg9650
    @nickg9650 ปีที่แล้ว +19

    What a fantastic explanation - all killer, no filler. Thanks!

    • @johndanson4427
      @johndanson4427 3 หลายเดือนก่อน

      All his videos work. Is this channel in a parallel universe?

  • @jasper5016
    @jasper5016 8 หลายเดือนก่อน +1

    Wow this is a fantastic video on EMR and Glue. Thanks.

  • @kingsabru
    @kingsabru ปีที่แล้ว

    Damn. You're good. I understood the use cases for both in one swing. Thanks 🙏

  • @endpermia
    @endpermia 9 หลายเดือนก่อน +1

    Thanks for the clear explanation!

  • @SiarheiKarko
    @SiarheiKarko ปีที่แล้ว +1

    Thanks a lot Johnny, awesome explanation as always!

  • @desloubser5678
    @desloubser5678 2 ปีที่แล้ว +1

    Thanks for the video!! It really helps a lot

  • @leoxiaoyanqu
    @leoxiaoyanqu 2 ปีที่แล้ว +2

    Thanks a lot! I got my answer so I think it's a great video!

  • @ryanshuell
    @ryanshuell 6 หลายเดือนก่อน +1

    Excellent! Keep 'em coming!!

  • @AVISH747
    @AVISH747 7 หลายเดือนก่อน +1

    Great stuff mate. Subscribed and Liked..!

  • @AlexXavier
    @AlexXavier 2 หลายเดือนก่อน

    So clear! Thank you!

  • @jriosfer
    @jriosfer ปีที่แล้ว +1

    Thanks for the explanation! good comparison

  • @bobhaffner5902
    @bobhaffner5902 2 ปีที่แล้ว +6

    Great job comparing the two options, Johnny

  • @shared_xp
    @shared_xp 2 หลายเดือนก่อน

    I have not heard PIG in forever, really enjoyed that language.

  • @GiasoneP
    @GiasoneP 2 ปีที่แล้ว +5

    I don’t know how you only have ~3k subscribers. What a trove of knowledge. Thank you

    • @JohnnyChivers
      @JohnnyChivers  2 ปีที่แล้ว +1

      Thanks for watching Jason.

    • @andregomesdasilva
      @andregomesdasilva ปีที่แล้ว +1

      Just a matter of time for him to get much more subscribers. The content is absolutely great

  • @channuangadi7504
    @channuangadi7504 10 หลายเดือนก่อน +1

    Crystal clear 🔮 explanation

  • @Alex-cn9ot
    @Alex-cn9ot ปีที่แล้ว +3

    I do almost the same code in AWS glue as EMR, I mean, I consume from external sources via spark JDBC connectors and publish the results to other warehouses via JDBC, I only have crawlers to detect the intermediate files that are generated at the datalake at the staging or business layer, but I don't use the studio or editor.
    I feel AWS glue more integrated in terms of managing the workflows and the status (cpu/ram,etc) than a EMR based service.

  • @jiezhu9593
    @jiezhu9593 2 ปีที่แล้ว +3

    I think you can request AWS to increase your quota in Glue to have more than 100 DPU enabled per glue job.

  • @whocares_today
    @whocares_today หลายเดือนก่อน

    amazing work

  • @marian6040
    @marian6040 ปีที่แล้ว +5

    Great explanation. How about Choosing between Glue and Emr serverless?

    • @georgeognyanov
      @georgeognyanov ปีที่แล้ว

      Was just thinking that as well. I think his points will still be valid since EMR serverless will be more expensive to run that the self-managed EMR and we still have the case of not utilizing it fully.

  • @kaushalroonwal4279
    @kaushalroonwal4279 6 หลายเดือนก่อน +1

    Hi Johnny, since there is EMR server less available now, do you think that the operational overhead is still one of the differentiator between the two? What do you recommend based on the EMR serverless?

  • @hotpeppermovie
    @hotpeppermovie 2 ปีที่แล้ว +4

    Would love if you could do a simple industry grade project starting from beginning to end!

    • @JohnnyChivers
      @JohnnyChivers  2 ปีที่แล้ว +3

      Definitely something I can look into. What would give the most benefit? Glue or EMR? Bearing in mind it would be a very long video as it would be at industry standard.

    • @hotpeppermovie
      @hotpeppermovie 2 ปีที่แล้ว +1

      @@JohnnyChivers glue would be nice since you mentioned its easier for beginner data engineers to learn and use. But yeah i agree it could be a whole course in itself. Perhaps you could split them up into smaller sections/videos if you do decide to do them

    • @groundingtiming
      @groundingtiming 4 หลายเดือนก่อน

      @@JohnnyChivers Hey John, great stuff, have there been an update to this please?

  • @jarosawsmiejczak1138
    @jarosawsmiejczak1138 2 ปีที่แล้ว +2

    BUY THIS MAN A COFFEE. Thanks Johnny!

  • @JiyuKim-sr1mi
    @JiyuKim-sr1mi 10 หลายเดือนก่อน

    Which one is a better option when building a transactional data lake?

  • @echezonaazubike8054
    @echezonaazubike8054 11 หลายเดือนก่อน +2

    I love your Scottish accent

  • @hellorsanjeev11
    @hellorsanjeev11 ปีที่แล้ว +1

    ETL code in pyspark or scala? Can I have it in Java instead?

  • @RohitPal-lz1wf
    @RohitPal-lz1wf 2 ปีที่แล้ว +1

    I have a requirement to copy the data from One DynamoDB to other DynamoDB within same account. Data in source table is of 2017 version while target is having 2019 version.
    Can you please suggest which Service will fit best with no downtime.

    • @abhijitcaps
      @abhijitcaps ปีที่แล้ว

      You can use AWS DMS in co-ordination with AWS Schema Conversion Tool

  • @drewhunt3328
    @drewhunt3328 2 ปีที่แล้ว +1

    For data wrangling only, what are differences between AWS Glue ETL and AWS Sagemaker Data Wrangler?
    Great videos!

    • @JohnnyChivers
      @JohnnyChivers  2 ปีที่แล้ว +1

      Sagemaker data wrangler helps you build a workflow using pre-created libraries mainly with the intention of using the data for ML.
      Glue ETL is were you write the code and the logic yourself from scratch. Of course you have any spark and python library at your disposal. And whilst Glue ETL can run ML algos, that doesn’t have to be the aim of your wrangling - unlike Sagemaker data wrangler.

  • @alanaugust6733
    @alanaugust6733 2 ปีที่แล้ว +1

    Does it make sense to do your proof of concept ETL code in Glue, then have EMR run that process at scale?

    • @JohnnyChivers
      @JohnnyChivers  2 ปีที่แล้ว +2

      Hi Alan, it certainly does. The one factor to be conscious of is load. If using a subset of data to develop your script in glue, there could be performance issue later down the line in EMR once the full dataset is used.

    • @alan2a1l
      @alan2a1l 2 ปีที่แล้ว

      @@JohnnyChivers Thanks, Johnny, for the response! Got it! Performance is always an issue, both in volume and composition. The test case would have to include the full range of expected inputs, but volume... well I guess you just have to run it at full volume and fix as necessary. Or parallelize with multiple Producers?...I'm sure you've dealt with it.

  • @omgleowtf
    @omgleowtf ปีที่แล้ว +2

    They now have EMR Serverless so I guess you don't need a cluster up and running 24/7 when you only need it every now and then

    • @vitaliryumshin6174
      @vitaliryumshin6174 ปีที่แล้ว

      yes, would interesting to get a comment from Johnny. how close those to each other..costs

    • @himalayasaikia5762
      @himalayasaikia5762 ปีที่แล้ว

      hey Im new to AWS...just wondering...even without serverless EMR, cant we use Transient EMR cluster to run and kill once the job is completed...that way we will not have to keep the cluster up and running

  • @shresthaditya2950
    @shresthaditya2950 11 หลายเดือนก่อน +1

    AWS Glue is for ETL Purposes and performing ETL Operations is way easier that is get the data into catalog and create jobs in scala or python and AWS will run without needing to manage Clusters, Infrastructure and Apache engine
    EMR requires knowledge of clustered computing so it may require a lot of infrastructure cost
    AWS glue is 20-40% is overhead cost but 1)But In AWS Glue we pay for what we use only that is it on demand service 2)On the other hand you have to pay AWS EMR all the time and in most companies around 80% there isn't any need to run Amazon EMR cluster
    It gives 100 Dpus (16GB ram 4 CPU per DPU)
    EMR Is better in AWS Glue and looking data in EMR because Glue requires Jobs for everything

  • @joshi1q2w3e
    @joshi1q2w3e 10 หลายเดือนก่อน +1

    So why do people use either of these when you can just use Databricks?
    Especially EMR seems like it can be replaced by Databricks.