Running Spark jobs on Amazon EMR Serverless

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 ม.ค. 2025

ความคิดเห็น •

  • @viewermm1588
    @viewermm1588 3 หลายเดือนก่อน

    Does anyone here knows if it is possible to use Spark to select/collect multiple Parquet files from s3 bucket ( all in "ABC" folder) and combined them in one Parquet file in ( "DEF") file in the same location? and if so what is the code , thanks

  • @AnGELsPearhead
    @AnGELsPearhead 2 ปีที่แล้ว

    Amazing Demo!!!

  • @kingsleywen3889
    @kingsleywen3889 2 ปีที่แล้ว

    Amazing. Could you do a tutorial about using step function with EMR Serverless? Thanks.

    • @dacort
      @dacort  2 ปีที่แล้ว +1

      EMR Serverless is not natively supported with Step Functions today, but there is a way to do it using Lambda functions.
      We have a blog post about it here, if it's helpful! aws.amazon.com/blogs/big-data/run-a-data-processing-job-on-amazon-emr-serverless-with-aws-step-functions/

  • @ManishBhandari-df2xf
    @ManishBhandari-df2xf ปีที่แล้ว

    Hi Great video - can you please also show steps on how to install external libraries on EMR - bootstrap script replacement?

    • @dacort
      @dacort  ปีที่แล้ว

      Assuming you're talking about EMR Serverless, there's a couple different options. You can use custom images ( docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/application-custom-image.html ) to install OS-level dependencies. If you're just talking about PySpark dependencies you can also bundle a virtual environment ( docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python-libraries.html ).

    • @srirajvasireddy2615
      @srirajvasireddy2615 11 หลายเดือนก่อน

      For pyspark dependencies like pandas or kafka. How to bundle a virtual environment?
      New to python, any help or suggestions are greatly appreciated.

  • @disrupcao4674
    @disrupcao4674 ปีที่แล้ว

    great video

  • @bariowd
    @bariowd 2 ปีที่แล้ว

    Amazing video
    do you know if there is any chance to send parameters from airflow DAG to the called notebook?
    For example the DAG receives a random date&&number then when you trigger the DAG it send those parameters to the notebook.
    Thank you! :)

    • @dacort
      @dacort  2 ปีที่แล้ว

      I didn't use notebooks in this video, the EMR StartNotebookExecution API allows you to pass parameters to notebook runs.
      We have a blog post about that here: aws.amazon.com/blogs/big-data/orchestrating-analytics-jobs-on-amazon-emr-notebooks-using-amazon-mwaa/

  • @julsgranados6861
    @julsgranados6861 ปีที่แล้ว

    Great video!! , Is there any way to run a dbt project using emr serverless?, I have seen that they have the Thrift option to connect to EMR on EC2, but I am not sure if it is possible to connect it to EMR serverless :(

    • @dacort
      @dacort  ปีที่แล้ว

      Unfortunately not as of today. :(

  • @subhomoysikdar
    @subhomoysikdar ปีที่แล้ว

    Is there a way to run EMR serverless with GPU? I want to run pyspark jobs with NVIDIA RAPIDS

    • @dacort
      @dacort  ปีที่แล้ว

      Not as of today. For that you'll still need EMR on EC2 or EMR on EKS.

    • @subhomoysikdar
      @subhomoysikdar ปีที่แล้ว

      @@dacort Ok. Thank you