AWS Tutorials - ETL Pipeline with Multiple Files Ingestion in S3

AWS Tutorials

มุมมอง 13 809

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 4 ต.ค. 2024
The code link - github.com/aws...
Handling multiple file ingestion is a Glue ETL Pipeline is challenge if you want to process all the ingested files at once. Learn how to build a pipeline which can handle processing of multiple files.

ความคิดเห็น • 37

@swapnilkulkarni6719 2 ปีที่แล้ว ⁺³
Really good..Thanks a lot for making such nice videos..Lot of learning from it..
@AWSTutorialsOnline 2 ปีที่แล้ว ⁺¹
It's my pleasure
@prakashr9221 ปีที่แล้ว
I was looking for this use case, and this is helpful. Thank you.
@AWSTutorialsOnline ปีที่แล้ว
Glad it was helpful!
@darkcodecamp1678 3 หลายเดือนก่อน
what we use in production is when glue job put data in raw s3 bucket it will create an AWS SNS notification which is subscribed by SQS then with the help of queue we trigger lambda :)
@ladakshay 2 ปีที่แล้ว ⁺¹
Good smart solution. We can also orchestrate the entire flow using Glue workflow or Step functions so we don't have to depend on S3 event and Lambda.
@AWSTutorialsOnline 2 ปีที่แล้ว ⁺²
indeed you can. if you search in my channel, I made two more videos about building pipeline using Glue Workflow and Step Function. But some of the audience were asking for handling S3 event in case of multiple file ingestion. So I made this video.
@ladakshay 2 ปีที่แล้ว
@@AWSTutorialsOnline Yes this use case can come up in any pipeline where we want to trigger next step after data is written in S3.
@imtiyazali7003 ปีที่แล้ว
Great info and great tutorial . Thank you !!
@saravninja 2 ปีที่แล้ว ⁺¹
Thanks for Great explanation!
@AWSTutorialsOnline 2 ปีที่แล้ว
You're welcome!
@arunasingh8617 2 ปีที่แล้ว ⁺¹
You are doing an excellent job! Get going :)
@AWSTutorialsOnline 2 ปีที่แล้ว
Thank you! 😃
@misekerbirega3510 2 ปีที่แล้ว
Thanks a lot, Sir.
@udaynayak4788 ปีที่แล้ว ⁺¹
Thank you for the valuable information, can you please cover incremental, wherein RDS is the source and redshift target with the SCD2 approach, pyspark script under glue should handle SCD2
@abhijeetjain8228 5 หลายเดือนก่อน
that would be nice to cover up!
@markkinuthia6178 ปีที่แล้ว
Thank you very Sir, I love how you teach using cases. My question is can the approach be used in production and can the same also be used in Redshift. Thanks.
@BhanuNatva 2 หลายเดือนก่อน
sir, can you also pls make a vedio on using DynamoDB ?
@lakshminarayanau3989 2 ปีที่แล้ว ⁺¹
Thanks for your videos, this channel is a good learning source..
is there any video which talks abt json files multiple nested arrays i.e arrays with in arrays flattening and move to redshift?
@AWSTutorialsOnline 2 ปีที่แล้ว
I have the following videos on nested JSON, hope they help.
th-cam.com/video/4AvBv-Rxrv4/w-d-xo.html
th-cam.com/video/2ChiQ_2f97U/w-d-xo.html
th-cam.com/video/PR15TVZDgy4/w-d-xo.html
@thegeekyreview2916 2 หลายเดือนก่อน
what happens to the s3 data in the next run? is it overwritten or appended
@helovesdata8483 ปีที่แล้ว ⁺¹
why write five files from the database? is that just to show how separate files would work in this example?
@AWSTutorialsOnline ปีที่แล้ว
yes.
@vivekjacobalex 2 ปีที่แล้ว ⁺²
Good video 👍 . I have one doubt, while pulling data from postgressql to raw folder, where did it mention to write files divided on employee record?
@AWSTutorialsOnline 2 ปีที่แล้ว
I did not. Once you choose parquet format with snappy compression, it does automatic partitioning based on size.
@deep6858 2 ปีที่แล้ว ⁺¹
Excellent. I am new to AWS and its services. Related question, with multiple files in S3 we trigger Lambda and further Lambda calls Glue Job and we have kept concurrency of both Lambda and Glue Job as 1. Will this work the same way or differently .Thanks
@AWSTutorialsOnline 2 ปีที่แล้ว
not sure about your question. But with concurrency 1 as well, the lambda will trigger for each file upload. Only difference is it will queue up in execution for concurrency.
@akshaybaura 2 ปีที่แล้ว
this is acceptable if you have control over the first glue process which is dumping files for you. What is the intended solution if you cant create a token/indicator file sort of thing?
@cloudcomputingpl8102 2 ปีที่แล้ว
how to run a glue job only on new files and not full data? If for example you have 700GB, it will take ages to run few hours job every file. Can anyone target me to a resource?
@tan2784 ปีที่แล้ว
Interesting. Is there an alternative way to create a single lambda function without the token? Suppose a user doesn't have control over how data is loaded to S3, but has to work with the files loaded regularly., i.e. every hour on a new s3 object level.
@AWSTutorialsOnline ปีที่แล้ว
There has to be some trigger to know that all files have come. It could total file size or file count. You can configure an event to log all new / updated files arriving and let Lambda check their count / total size and if threshold is reached - trigger the pipeline.
@sivaprasanth5961 2 ปีที่แล้ว ⁺¹
how can I select destination as my state machine?
@AWSTutorialsOnline 2 ปีที่แล้ว
sorry, could not get your question? Can you please elaborate a bit?
@SandeepKumar-ne1ln 2 ปีที่แล้ว
Given Glue is server-less, is it really a problem having multiple Glue jobs triggered for individual file in raw zone?
@AWSTutorialsOnline 2 ปีที่แล้ว ⁺¹
no really. but sometime when you are doing aggregation based processing, you want all the files to land before processing. Also multiple instances of Glue will also increase cost.
@SandeepKumar-ne1ln 2 ปีที่แล้ว
@@AWSTutorialsOnline Another question I have is - If multiple files are being created, then instead of an S3 event trigger Lambda function... can't we trigger Lambda on a Glue event (when Glue job completes writing all files in S3)?
@AWSTutorialsOnline 2 ปีที่แล้ว ⁺¹
@@SandeepKumar-ne1ln you can using EventBridge based event. I did talk about it some other videos.

ต่อไป

เล่นอัตโนมัติ

AWS Tutorials - Auto Scaling AWS Glue Job