End-To-End Data Engineering Project in 40 Minutes | AWS Cloud | PySpark

  • เผยแพร่เมื่อ 18 ธ.ค. 2024

ความคิดเห็น • 138

  • @shivam87480
    @shivam87480 7 หลายเดือนก่อน +17

    used aws glue under free teir cost a some amount after i month which i used because i was unaware of the extra charges..so i request u please provide a commet while making the video that which services can cost money and which services can be used under the free teir , that will be very helpfull for newbies like me...

  • @hlulaniwinners7076
    @hlulaniwinners7076 10 หลายเดือนก่อน +5

    That was really good to follow...100% worked and I learned so much more in 40min😀😃

    • @adityatomar9820
      @adityatomar9820 9 หลายเดือนก่อน

      i got a bill of 2.80 dollars by just running glue etl once...I dont know how im gonna create more projects if they keep billing like thiss..i cant afford fee rn...? what can i do?

    • @BOSS-AI-20
      @BOSS-AI-20 3 หลายเดือนก่อน

      @@adityatomar9820 make new free tier account

  • @OlafKoch-j5y
    @OlafKoch-j5y 9 หลายเดือนก่อน +3

    thats what i was looking for. thank you :)

    • @OlafKoch-j5y
      @OlafKoch-j5y 9 หลายเดือนก่อน +1

      also, you should create a playlist with all data engineering projects you already done, gonna be easy to find :)

  • @rampatil-t6u
    @rampatil-t6u หลายเดือนก่อน

    Incredible work on data engineering project! The ability to design, implement, and optimize the entire pipeline from data ingestion to processing and visualization is a true testament to your skillset. The attention to detail, efficiency in workflow, and the seamless integration of various tools and technologies are impressive. Your understanding of data architecture and best practices shines through in every step of the project. Keep up the fantastic work

  • @ravi19900
    @ravi19900 9 หลายเดือนก่อน

    Amazing content... This is first AWS DE video I watched in practical and I am glad I found this video.
    Thank You
    Can you please share some automated way of doing ingestion process in s3 staging folder and some preprocessing demo followed up by some SCD Type 2 implementation on glue?

  • @TonyRydinger-bq9pk
    @TonyRydinger-bq9pk 8 หลายเดือนก่อน +4

    Great video, can you please explain the preprocessing part, what exactly did you use to preprocess the datasets, was it a python script in pandas or something else?

  • @dggh7879
    @dggh7879 10 หลายเดือนก่อน +1

    Great project for beginners!!

  • @ganesh.majety5260
    @ganesh.majety5260 11 หลายเดือนก่อน +1

    Just watched 1 video, u gained a subscriber 🎉. Hope more from u😊

  • @adityatomar9820
    @adityatomar9820 9 หลายเดือนก่อน

    OH GOD! AWS UI always made me overwhelmed and scared me....But you just explained everything so beautifully...Thankyou soo much mann.....I finally feel confident that i can learn AWS and build awesome projects...
    BTW will AWS charge us for using ATHENA and GLUE as they don't come under free trial...?

    • @datewithdata123
      @datewithdata123  9 หลายเดือนก่อน +1

      For completing this project the bill will be less than half a dollar(if you don’t run a glue job a lot).

  • @sanjeevpandey2753
    @sanjeevpandey2753 11 หลายเดือนก่อน +1

    Nice one bhai, very precise and clear explanation

  • @ibrahimfadhili6621
    @ibrahimfadhili6621 3 หลายเดือนก่อน

    I love it ❤Thanks man

  • @ruben3815
    @ruben3815 11 หลายเดือนก่อน

    good job dude!

  • @swapnilgaikwad9738
    @swapnilgaikwad9738 11 หลายเดือนก่อน

    Good please again one end to end aws-data project video

  • @avinash7003
    @avinash7003 11 หลายเดือนก่อน +11

    can you do on s3,glue,emr,lambda,athena,redshift

    • @datewithdata123
      @datewithdata123  9 หลายเดือนก่อน +2

      Ongoing. Will be released soon

  • @binod8720
    @binod8720 หลายเดือนก่อน +1

    The crawler failed to automatically create the table from s3 directory where parquet datasets are stored, not sure what happened as i followed the exacts steps given the glue have been given access to s3 creating a role to that specific user, any feedback on how to resolve this issue?

    • @manojk1494
      @manojk1494 หลายเดือนก่อน

      Are you able to resolve this issue ?

    • @binod8720
      @binod8720 หลายเดือนก่อน

      @manojk1494 yep resolved, thanks 😊

  • @ajtam05
    @ajtam05 9 หลายเดือนก่อน +1

    Would you know why the 'Data preview' on joins may not populate any data aka 'No data to display'? I did a sanity check and the albums and artists files (in excel) , do indeed have matching data in the artist_id (album) to id (artist).
    But when I join on those conditions, as you did, it doesn't populate any data. Just to see, I tried right and left join, and that actually populated data for each respective side (oddly enough).
    Seems like a glitch, but because the script it simple and the join script looks correct. Do you know if the data types are converted or something else occurs behind the scenes when you join in Visual ETL?

    • @ajtam05
      @ajtam05 9 หลายเดือนก่อน

      I basically can't do the project because the subsequent nodes require data being fed from previous nodes. But there's no data at the first join (album/artist). Really odd.

    • @datewithdata123
      @datewithdata123  9 หลายเดือนก่อน

      Please check you have your data in s3.

    • @himanshusaini011
      @himanshusaini011 8 หลายเดือนก่อน

      Yes we do have the data in s3 but the same issue is also popup for me as well

    • @danielpequeno33
      @danielpequeno33 4 หลายเดือนก่อน

      I have the same problem, could any of you solve it? @himanshusaini @ajtam05

    • @iam_.satheesh9314
      @iam_.satheesh9314 5 วันที่ผ่านมา

      @ajtam05 i also faced same issue how to solve it

  • @kunalnkalore
    @kunalnkalore 6 หลายเดือนก่อน

    in real time do we have to perform these task regarding IAM, etc or do we have to jst run terraform scripts or something similar and our architecture or cluster spins up? can you clear this real time working process?

  • @BommineaniSai
    @BommineaniSai 2 หลายเดือนก่อน +1

    I'm facing an issue with joining the Tracks with the album & artist as it showing as NO SOURCE KEY for the album& artist join condition, Can you help pls?

    • @bullshere1122
      @bullshere1122 2 หลายเดือนก่อน

      same issue please help anybodyyyyyyyyyyyyyyyyyyyyyyyyy

    • @datewithdata123
      @datewithdata123  2 หลายเดือนก่อน

      Please look if you have provided correct join condition.

    • @iam_.satheesh9314
      @iam_.satheesh9314 5 วันที่ผ่านมา +1

      i faced that issue how to solve

  • @TonyRydinger-bq9pk
    @TonyRydinger-bq9pk 8 หลายเดือนก่อน

    Great video, can you let us know what did you use for preprocessing, was it a python script in pandas or something else?

    • @tulasipanthagani6401
      @tulasipanthagani6401 8 หลายเดือนก่อน

      can u please help y crawler is not running, it is asking some permission ,which permission we need to add

  • @mushkarasaiprakash1915
    @mushkarasaiprakash1915 6 หลายเดือนก่อน

    how did you preprocess data, what all you removed or changed while preprocessing the data

  • @mwanthidaniel1254
    @mwanthidaniel1254 10 หลายเดือนก่อน +1

    Is S3 a data warehouse or data lake?

    • @datewithdata123
      @datewithdata123  10 หลายเดือนก่อน +1

      S3 is neither a warehouse nor a data lake; it's an object storage service provided by AWS, but can be used as both because it can manage large volumes of structured and unstructured data for analytics, processing, and other purposes.

  • @kukhwa
    @kukhwa 3 หลายเดือนก่อน +1

    I've followed all the steps you shared, but I can't run the crawler. It seems like the error, AccessDeniedException, is related to CloudWatch logs. So, I've added CloudWatch logs full access. But still not working. Do you have any insights?

    • @shaikgouse4u
      @shaikgouse4u 2 หลายเดือนก่อน

      Cuz I have fixed this one and created table successfully

    • @datewithdata123
      @datewithdata123  2 หลายเดือนก่อน

      Please provide Iam permission to the user (administrator or necessary permissions)

    • @vidhyabharathi3947
      @vidhyabharathi3947 2 หลายเดือนก่อน

      ​@@datewithdata123Even I faced the same issue i have provided full access but still unable to run the crawler

    • @hemanthkumar7782
      @hemanthkumar7782 2 หลายเดือนก่อน

      @@vidhyabharathi3947 I have added glue service role and then it worked.

    • @shaikgouse4u
      @shaikgouse4u 2 หลายเดือนก่อน

      ​@@vidhyabharathi3947 Try following below steps:
      Fix for crawler error:
      1. Using Root user go into "AWS Glue console --> Getting Started page"
      2. Click on "Setup roles & users" option
      3. Choose your IAM User
      4. Next stage select "Grant full access to Amazon S3" --> "Read and write"
      5. Select the recommended "AWSGlueServiceRole"
      6. Review & apply changes
      7. Go to IAM console --> Access Management --> Roles. Here you'll see the role "AWSGlueServiceRole" created and assigned to IAM User selected in step-3
      8. Re-run the crawler job and it'll complete successfully.

  • @VishalPatel-bg5kt
    @VishalPatel-bg5kt 12 วันที่ผ่านมา

    As with all others, I am also facing the same issue when joining the track with 1st join(album & artist) since the 1st join is unable the infer the schema hence when joining with track it is not showing the track_id field in the join condition. Please provide a solution to this.

    • @iam_.satheesh9314
      @iam_.satheesh9314 5 วันที่ผ่านมา

      i faced same issue,how to solve

  • @djsamxgaming5732
    @djsamxgaming5732 6 หลายเดือนก่อน +1

    I dont know why i am not able to see the output in datawarehouse, but i can see 100% success rate in job monitoring window. Could you tell me what will be the problem in this???

  • @gnaneshwaripanthagani3515
    @gnaneshwaripanthagani3515 8 หลายเดือนก่อน +2

    In AWS glue when I am creating pipeline in transform join I am not getting option to select any source key can u plzz help

    • @FredRohn
      @FredRohn 8 หลายเดือนก่อน +3

      I used infer schema, and that seemed to fix the problem for me :)

    • @gnaneshwaripanthagani3515
      @gnaneshwaripanthagani3515 8 หลายเดือนก่อน

      ⁠@@FredRohnThank you so much,it works for me

    • @iam_.satheesh9314
      @iam_.satheesh9314 5 วันที่ผ่านมา

      @@FredRohn how to solve this please explain

  • @eugenia6490
    @eugenia6490 11 หลายเดือนก่อน

    Question please. 26min:38sec timestamp - you mentioned that the job created multiple blocks. Why are there multiple blocks? Thank you!

    • @datewithdata123
      @datewithdata123  11 หลายเดือนก่อน

      We have created two worker nodes and since we have very little data. we could see that there were exactly 2 files in our warehouse table.

    • @eugenia6490
      @eugenia6490 11 หลายเดือนก่อน

      @@datewithdata123 Thank you!

  • @CricketLover-qy9nn
    @CricketLover-qy9nn 10 หลายเดือนก่อน +1

    I'm unable to the trackid from the join album and artist. What might be the reason

    • @KomalChavan-ht7wm
      @KomalChavan-ht7wm 8 หลายเดือนก่อน +1


    • @KomalChavan-ht7wm
      @KomalChavan-ht7wm 8 หลายเดือนก่อน +1

      hey how u resolved this issue?

    • @FredRohn
      @FredRohn 8 หลายเดือนก่อน

      use infer schema, that fixed the problem for me@@KomalChavan-ht7wm

    • @FredRohn
      @FredRohn 8 หลายเดือนก่อน

      try infer schema, that made it work for me

    • @danielpequeno33
      @danielpequeno33 4 หลายเดือนก่อน

      did you find a way to solve it?

  • @vishalkanvajiya-j8t
    @vishalkanvajiya-j8t 2 หลายเดือนก่อน

    bro visualization ka bhi explain kro

  • @vidhyabharathi3947
    @vidhyabharathi3947 2 หลายเดือนก่อน

    I am unable to run athena query it is showing unable to kind parquet format

    • @datewithdata123
      @datewithdata123  2 หลายเดือนก่อน

      Have a look if you have provide correct path for s3, with right permissions

  • @adityatomar9820
    @adityatomar9820 9 หลายเดือนก่อน +1

    plz also tell how to push these kind of projects on GITHUB

    • @nguyentien4711
      @nguyentien4711 4 หลายเดือนก่อน

      this procedure should not be on your github, it's just a BI tool while github is the place to show your code skill and project build merely by code from scratch

  • @KomalChavan-ht7wm
    @KomalChavan-ht7wm 8 หลายเดือนก่อน

    at time of trasforming enable to join table on condtion data is not fetching at column? is anybody help me

  • @mkdTech369
    @mkdTech369 2 หลายเดือนก่อน

    More Video please

  • @badboy1585
    @badboy1585 9 หลายเดือนก่อน

    hello bro, the services you are used in this project are comes in free tier right ? or we have to pay

    • @datewithdata123
      @datewithdata123  9 หลายเดือนก่อน

      Some of the services are not under free tier.
      For completing this project the bill will be less than half a dollar(if you don’t run a glue job a lot).

    • @adityatomar9820
      @adityatomar9820 9 หลายเดือนก่อน

      @@datewithdata123 i got 2.80 dollar bill just after running etl once in glue

  • @udaykirankankanala3635
    @udaykirankankanala3635 8 หลายเดือนก่อน +1

    When i am trying to save visual etl job it is showing me error as create job:access denied exception
    What is the policy we have to add in root account?

    • @datewithdata123
      @datewithdata123  8 หลายเดือนก่อน


    • @udaykirankankanala3635
      @udaykirankankanala3635 8 หลายเดือนก่อน

      I am unable to find that policy in root account
      Please help me

    • @datewithdata123
      @datewithdata123  8 หลายเดือนก่อน +1

      Or provide iam full access.

    • @FredRohn
      @FredRohn 8 หลายเดือนก่อน

      did you solve this issue? I am experiencing the same thing. @@udaykirankankanala3635

    • @FredRohn
      @FredRohn 8 หลายเดือนก่อน

      how do i do this? I'm having a similar issue@@datewithdata123

  • @shivam87480
    @shivam87480 7 หลายเดือนก่อน

    can anyone tell how to showcase the project in github or put it in resume????

  • @kumarsumit6117
    @kumarsumit6117 9 หลายเดือนก่อน

    could you please help me after sucessfully running Glue pipline data s not stored in final s3 bucket

    • @datewithdata123
      @datewithdata123  9 หลายเดือนก่อน

      Please share your error SC at datewithdata1@gmail.com

    • @rahulcp7013
      @rahulcp7013 4 หลายเดือนก่อน

      Were you able to resolve this issue, I am also facing the same

  • @akshaypy4117
    @akshaypy4117 7 หลายเดือนก่อน

    Crawler will not run with just s3 full access as shown here right?

    • @datewithdata123
      @datewithdata123  7 หลายเดือนก่อน

      You may need to add IAM:Full Access if you are working as an IAMUser

    • @sidharthv1060
      @sidharthv1060 7 หลายเดือนก่อน

      @@datewithdata123 I have added IAM:Full Access also within the role glue_access_s3 but again failed to run crawler.

    • @VivekYadav-og4lt
      @VivekYadav-og4lt 7 หลายเดือนก่อน

      ⁠@@sidharthv1060I think you need add AWSGlue service role

    • @supriya9047
      @supriya9047 7 หลายเดือนก่อน

      @@sidharthv1060 I am also facing the same issue repeatedly, even after providing all the required access.

    • @kshitijjoshi2092
      @kshitijjoshi2092 6 หลายเดือนก่อน


  • @rahulteja4849
    @rahulteja4849 11 หลายเดือนก่อน

    While joining the tables in visual etl, i could not add the condition as i could not look for colum names it is not showing me any columns

    • @tokyochannel5684
      @tokyochannel5684 10 หลายเดือนก่อน


    • @vichitravirdwivedi
      @vichitravirdwivedi 9 หลายเดือนก่อน +1

      Refresh it multiple times. it happened with me too

    • @datewithdata123
      @datewithdata123  9 หลายเดือนก่อน

      This may happen sometimes when you have slow internet connection. Bcz glue will read the schema from data present in S3. Hence the connection need to be set.

    • @himanshusaini011
      @himanshusaini011 8 หลายเดือนก่อน

      ​@@vichitravirdwivedi I already did it multiple times but no output

    • @FredRohn
      @FredRohn 8 หลายเดือนก่อน

      try to use infer schema, all of the fields popped up for me after doing that. @@himanshusaini011

  • @Gauravsingh-hx6lw
    @Gauravsingh-hx6lw 4 หลายเดือนก่อน

    When i add policy for glue its not working can you help me

    • @AshutoshParashar-u5l
      @AshutoshParashar-u5l 4 หลายเดือนก่อน

      glue_s3_role which you have created assign glue access to it it will work!

  • @Divya-gn5lh
    @Divya-gn5lh 4 หลายเดือนก่อน

    hey @datewithdata firstly I like ur project playlist if uhh share the source code with us it would be helpful for us.....thank for content

  • @SS-gv8kh
    @SS-gv8kh 7 หลายเดือนก่อน

    when I am running glue job it's successful but ouput files are not created in s3. Did you or anyone face similar issue?

    • @AshutoshParashar-u5l
      @AshutoshParashar-u5l 4 หลายเดือนก่อน

      the visual ETL for every node are you seeing greed ticked if no ten the ETL process is not completed as per design. Make sure all the nodes are green then run it. I faced same error and have resolved and its working as expected.

  • @ajtam05
    @ajtam05 9 หลายเดือนก่อน +1

    iam:PassRole error when trying to attach the role to the project. iam:PassRole looks very confusing, but I'm not sure why no one else is encountering this issue.

    • @ajtam05
      @ajtam05 9 หลายเดือนก่อน

      User: arn:aws:iam::905418287400:user/proj is not authorized to perform: iam:PassRole on resource: arn:aws:iam::905418287400:role/glue_access_s3 because no identity-based policy allows the iam:PassRole action

    • @datewithdata123
      @datewithdata123  9 หลายเดือนก่อน +3

      In the beginning while creating IAM user, plz add IAMFullAccess.
      This is happening because the "iam:PassRole" action is required when a service like AWS Glue needs to pass a role to another AWS service.

    • @ajtam05
      @ajtam05 9 หลายเดือนก่อน

      @datewithdata123 OK, I will try that. I tried multiple solutions with regards to creating a new policy and attaching it to the user, but no luck. Hope that works. 🙏

    • @ajtam05
      @ajtam05 9 หลายเดือนก่อน

      ​@@datewithdata123 Yep, that worked. Thanks for that.

    • @ajtam05
      @ajtam05 9 หลายเดือนก่อน

      @@datewithdata123 I believe that change has affected the way joins are occurring. before i was able to join the album & artist join w/ the tracks. but now the ablum & artist join doesn't populate any data. it looks like people have similar issue when i google, but no solutions provide online. are you aware?

  • @ishwariupadhyay8122
    @ishwariupadhyay8122 11 หลายเดือนก่อน

    Can you provide your github link for preprocessing data.

    • @datewithdata123
      @datewithdata123  9 หลายเดือนก่อน

      Sorry didn’t save the code. We have used visual etl so the code was auto generated.

  • @vivekpawar3069
    @vivekpawar3069 9 หลายเดือนก่อน

    sit please attach
    preprocessing of csv file code

  • @backgrounding4821
    @backgrounding4821 7 หลายเดือนก่อน

    Hello can you please update the Processed Data Link please.

    • @datewithdata123
      @datewithdata123  7 หลายเดือนก่อน


    • @backgrounding4821
      @backgrounding4821 7 หลายเดือนก่อน

      @@datewithdata123 thanks! (Y)

  • @avinash7003
    @avinash7003 11 หลายเดือนก่อน

    please upload the Glue script

    • @datewithdata123
      @datewithdata123  9 หลายเดือนก่อน

      Sorry didn’t save the code. We have used visual etl so the code was auto generated.

  • @HanhNguyen-sp8zo
    @HanhNguyen-sp8zo 10 หลายเดือนก่อน +1

    is it free ?