How to build and automate a python ETL pipeline with airflow on AWS EC2 | Data Engineering Project
ฝัง
- เผยแพร่เมื่อ 2 มิ.ย. 2024
- In this data engineering project, we will learn how to build and automate an ETL process that can extract current weather data from open weather map API, transform the data and load the data into an S3 bucket using Apache Airflow. Apache Airflow is an open-source platform used for orchestrating and scheduling workflows of tasks and data pipelines. This project will be entirely carried out on AWS cloud platform.
We will cover the fundamental concepts of Apache Airflow such as DAG and Operators and I will show you how to install Apache airflow from scratch and schedule your ETL pipeline. I will also show you how to use sensor in your ETL pipeline.
As this is a hands-on project, I highly encourage you to first watch the video in its entirety without following along so that you can better understand the concepts and the workflows after which you should either try to replicate the example I showed without watching the video but consult the video when you are stuck or you could watch the video again the second time in its entirety while also following along this time.
Remember the best way to learn is by doing it yourself - Get your hands dirty!
If you have any questions or comments, ok to ask or leave comments in the comment section below.
Books I recommend
1. Grit: The Power of Passion and Perseverance amzn.to/3EZKSgb
2. Think and Grow Rich!: The Original Version, Restored and Revised: amzn.to/3Q2K68s
3. The Book on Rental Property Investing: How to Create Wealth With Intelligent Buy and Hold Real Estate Investing: amzn.to/3LLpXRy
4. How to Invest in Real Estate: The Ultimate Beginner's Guide to Getting Started: amzn.to/48RbuOb
5. Introducing Python: Modern Computing in Simple Packages amzn.to/3Q4driR
6. Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter 3rd Edition: amzn.to/3rGF73G
**************** Commands used in this video ****************
sudo apt update
sudo apt install python3-pip
sudo apt install python3.10-venv
python3 -m venv airflow_venv
sudo pip install pandas
sudo pip install s3fs
sudo pip install apache-airflow
airflow standalone
sudo apt install awscli
aws configure
aws sts get-session-token
**************** USEFUL LINKS ****************
Extract current weather data from Open Weather Map API using python on AWS EC2: • Extract current weathe...
How to remotely SSH (connect) Visual Studio Code to AWS EC2: • How to remotely SSH (c...
PostgreSQL Playlist: • Tutorial 1 - What is D...
Weather Map API: openweathermap.org/api
Github Repo: github.com/YemiOla/data_engin...
Please don’t forget to LIKE, SHARE, COMMENT and SUBSCRIBE to our channel for more AWESOME videos.
DISCLAIMER: This video and description has affiliate links. This means when you buy through one of these links, we will receive a small commission and this is at no cost to you. This will help support us to continue making awesome and valuable contents for you. - วิทยาศาสตร์และเทคโนโลยี
Thanks for taking the time to talk this video out!
Really good tutorial. Nicely done. Looking forward to part 2!
Thanks! I'm glad you like it!
Thank you very much for making the concepts so easy to understand👌
this video is so clear and helpful. there are many airflow courses, but this video goes beyond and helps you "practice" airflow. hats off to the master and look forward to more awesome videos!!!
This was just what I was looking for! Now it's time to apply it on my own projects. Keep the good work! Big thank you from Brazil!
Thank you! I'm glad you like the video. Always a good idea to apply the learning in another project. Goodluck!
I've been looking for an ETL project videos that I can follow to learn basic data engineering stuff and finally I found your video! Thank you for this!
Awesome, thank you! I'm glad it was helpful. Please go ahead to explore other videos to take your skill to the next level.
This video has the signature of a master teacher. You introduced key concepts in a way that is simple to understand. Thank you for starting from level 1 without any assumptions of what we viewers/learners bring to the subject.
Thanks so much for this comment. It really means a lot to me. I'm glad you found it valuable.
bring more on databricks as well end to end please. @@tuplespectra
Amazing !! Looking forward to many more
I have just released a second part to this where I showed tasks running in parallel. See link here: th-cam.com/video/DKsf88oCPWA/w-d-xo.html
This tutorial is great. Looking up to more videos. You got a follower.
I'm glad you love it! Thank you! And thanks for subscribing. You can also explore our playlist of Postgresql video series.
Your channel is pure gold
Thanks so much. I'm glad you find our videos valuable. We hope to continue to provide you with more awesome instructional videos.
Very detailed and basics tutorial with actual hands-on recorded. No PPT's simply a basic teaching which is very helpful for data engineer.
Glad it was helpful! Thanks so much for your comment.
After watching this video, I knew I had to thank you for this truly awesome video. I have learnt more from this video than from many others out there. You are amazing.
Great to hear! Thanks so much for this comment. It means a lot to me.
simple and clearly explained, THANKS !!
Thanks for your comment. I'm glad you found the video helpful.
congratulation, good job, part 2, dont forget
Second part coming out soon.
idk why you are not at least as hype as Zach Wilson. thank you very much for giving out high quality content for free!
Thanks so much for your comment. I'm glad you found the video helpful.
Thanks for this wonderful tuto. It’s time for me to practice now 🙏🏽
You’re welcome 😊 . Yes! Practice!
Great tutorial! Thank You.
Glad it was helpful! Thanks for your comment.
This is awesome! Thanks a lot
Thank you!
amazing man...would love more airflow/dags/python tutorials and also maybe how i can use with scraping data..cheers!
Thanks! We have more videos on airflow you can explore. We will look into your request as well. Thanks!
Being an aspiring data engineer, this project is really helpful . Thank you so much for this content :)
You're very welcome! Thanks for your comment.
can it run in free tier EC2 instance?
Thank you, this was super helpful!
You're so welcome!
thank you for this tutorial
i was initially puzzled or worried if i can grasp all but thanks for this video. this helps to dive in to code with airflow
Glad it helped!
This is really good info! Thank you! One possible area to further advance this video is to upgrade the final task by loading data to an actual database (PostgreSQL for example).
Thanks so much for your comment. Means a lot to me. I agree that we can add a final task to load the data in a database. I have done that in other videos where we loaded to PostgreSQL and in another video we loaded to AWS redshift and yet in another video we loaded to snowflake. Please see my airflow playlist for several airflow projects to explore. Thanks so much. th-cam.com/video/DKsf88oCPWA/w-d-xo.html
Awesome project. Well explained
Thank you!
If all content creators use this method, showcasing their skills by creating projects, they would inspire so many. Thank you tuplespectra once again.
@@peterkatongole5984 Thanks for the comment.
Excellent information, thank you so much for posting this video here
Glad it was helpful!
Excellent presentation. Even though I'm an experienced person still I need to learn a lot from your videos. This reminds me to watch more of your other videos in future. Good work and keep it up.
Glad it was helpful! Thanks for your comment.
Thank you for this amazing, well-explained project on apache airflow. I hope you'll make a tuto on apache superset too.
I'm glad you love it! Thank you!
Helped me to upskill my knowledge, Awesome tutorial keep it up!!!
Glad it helped!
so how do we make an instance. do we have to pay for this?
you're good. I am looking forward to the next video.
Thank you so much. I'm glad you find our videos valuable.
Simply… your the best
Thanks so much for your comment. It made my day.
Thank you so much for the video. ❤😊
Thank you!
Excellent tutorial
Thank you!
Awesome explanation
Thanks!
Thank you very much for the amazing content. Already subscribed to your channel.
You are welcome. Awesome! Thank you for the sub!
Great video 👍👍👍👍👍
Thanks 👍
Amazing video❤
Glad you liked it!!
Thank you so much
You are clear!!
Thank you!
Thank you for this good end-to-end example. However, there are some overheads, e.g. you don't need AWS STS tokens, AWS CLI setup is enough. Also, there is no need to IAM role for EC2. Greetings from Croatia!
Thank you so much for the video, just have a question, how can we make the airflow scheduler keep running after exit ssh? I noticed that the airflow scheduler stopped running after a while or if you exit the connection interface to ec2.
Good tutorial
Thank you!
I am very grateful to see this kind of teaching, and I have searched a lot to see a way to connect airflow to the free instance I found a blog on medium that demonstrates that: ( How to Install Apache Airflow on AWS EC2 Instance?)
Glad it was helpful!
excellent vibe, i love it. other tutorial make me sleep
Thanks for your comment. It really means a lot to me.
This is the bomb! Ose!
Thank you!
without using of AWS access key it will work too, I think you just had to wait few seconds longer. I liked this project and hope to see more!
Thank you! I'm glad you find the video valuable.
can you tell us how? I also created the role but it does not seem to allot the ObjectPut access.. even with S3FullAccess
Been following along with this but used GCP instead as it's got $300 free credits for sign up. Might be worth doing that in future?
Thanks for the great content. Teaching style is brilliant. I'll definitely be checking out your other videos. All the best man.
Thanks!
This video is an absolute game-changer for anyone looking to build and automate Python ETL pipelines with Airflow on AWS EC2!
Waiting for part 2!
Second part coming out soon.
Ola, you dey try, nice one
Thank you!
Good beginners video. I am personally not a fan of running your processing code and orchestration on same instance. How do you ensure package dependency,manage virtual environments,resource allocation for different workloads.
Thank you so much, my father
Thank you!
When choosing the OS for the instance...does it matter what the local machine is? I'm having error issues with trying to call Airflow "airflow standalone" command. There are multiple errors so have to look at each one, but I wanted to see first if selecting the same OS "ubuntu" and other criteria might be the issue?
Issue was the storage size. “ec2 triggerer_job_runner.py:576} info - triggerer's async thread was blocked.....". Thought it was a different issue because there were many error msgs (like Kubernetes not installed). But after googling the issue...finally found out it was storage size of the VM. That resolved it.
I have followed the exact same steps this weekend but this time i am encountering error when I say airflow standalone. The same exact setps I followed last weekend was working flawlessly. Below is the error I am getting this time, Any help would be appreciated. Error:
pydantic.errors.PydanticUserError: A non-annotated attribute was detected: `dag_id = `. All model fields require a type annotation; if `dag_id` is not meant to be a field, you may be able to resolve this error by annotating it as a `ClassVar` or updating `model_config['ignored_types']`.
I also had the same error as you.
Same for me....Same issue
I had the same error
I was able to get pass the error. All I did was to downgrade from pydantic 2.0 to 1.10.10. I hope it works for you
@@bukolasalami9021 How to downgrade from pydantic 2.0 to 1.10.1
my guy is 100% nigerian
I was enable to open Airflow on EC2 instance, and establish an SSH connection to my Visual Studio. However, my DAG isn't getting created on Airflow. Can you please help?
Amazing explanation, I've tried with Azure VM, if anyone is trying with that installation steps for airflow will be little bit different
Thank you!
I was able to do this just fine, I had 2 issues though, the t2.small didnt have enough memory, it kept freezing up, so I had to use a t2.medium instance, then it worked perfectly!
I'm glad all went all for you. Thanks for watching the video.
is medium availabe in free tier?
@@swarupsaha9451 unfortunately no
@@swarupsaha9451 No, you will have to pay for using medium, however it is not expensive in as much you did not leave it running extensively.
t2.small has low ram thus wasn't able to run airflow ,one alternate solution is to use swap space.
thank you so much. i did this on azure but the connection part is different than aws and it didnt work
@tuplespectra When i try to login airflow by using instance public ip and port 8080. Unable to connect. I have try all things related to security and all
hi i can install panda package in airflow container for docker?
great video!!!please can i help me with the installation of airflow because i have a few problems!
Thanks! What's the problem you got?
at 58:41 you say to save the python file and it should sync to airflow, ive waited several minutes but dont see any the dag appearing
how do i ensure hte file is saved? or can i see even where it failed to be uploaded to airflow?
I suggest you look at you code to ensure there are no mistakes or typo after which restart your airflow server. Let me know if this fix your issue. Thanks for reaching out.
ok i didnt restart my airflow server, sounds like thats where i may have gone wrong@@tuplespectra
I am getting the error at the last step, i.e to store the csv file in the s3 bucket. can you please help me with that
the log file shows like-- ERROR - Failed to execute job 42 for task transform_load_weather_data ([Errno 22] The provided token is malformed or otherwise invalid.; 54783)
Access key for openweather is not working when integrated with airflow, but it is working fine if used without airflow. Any reason why this could be happening?
I'm guessing you are probably making some mistakes. Could you watch the video again and follow it step-by-step? That way you might be able to see what the issue is.
It was excellent sessions no doubt, can you please tell us if we want to take multiple country's like, India, America, France, ... etc then how to do that in that case ??????
Check out this video where I showed how to extract for multiple cities th-cam.com/video/ocFzNmgYW9o/w-d-xo.html
@@tuplespectra Thank you so much noticing out my concern :)
When import DAG from airflow it shows Module airflow not found can you please guide?
Great tutorial. However can you show how not to hard code the critical information like key id, secret key and the rest?
Thank you! We will explore this in another video. Noted!
Hello sir,
Any end to end AWS Data Engineer projects. So that we will tell in interview as 2 yoe.
You can work on my projects, understand them and discuss them during interviews.
MySQL vs PostgreSQL . Which one should i go for as a begineer
Actually, you can go for either one as a beginner. But I will say since you are already getting the foundations to SQL from my channel using postgresql, I will say to go with postgresql to master the sql skills.
I want to make my airflow run every hour, but it seems my token is expired (the one that I get with aws sts get-session-token). how can i get aws credentials that has no expiration?
for example, each time i run, how can i run aws sts get-session-token and make the result into python variable?
Hi
I am getting an error while copying and opening the Running
Public IPv4 DNS(8080)
Error: connection timed out
Please help, Thanks
if we are work in venv, we cant use sudo
yay, i managed to finish it! and i have the csv file in s3. thx, u deserve the like lol.
Nice work! You did it! Keep learning! Keep growing!
Hello bro can you explain why airflow throw DAG import error: Broken DAG: [/home/ubuntu/airflow/dags/weather_dag.py] Traceback (most recent call last):
File "", line 241, in _call_with_frames_removed
File "/home/ubuntu/airflow/dags/weather_dag.py", line 8, in
import pandas as pd
ModuleNotFoundError: No module named 'pandas'
I installed pandas with sudo pip install pandas but it still there. Could you explain this. Thank you so much for your work !
I am getting Permission Denied: No AccessKey Presented when i try to run my DAG.
I have created access keys from my AWS, and have also pasted them in aws_credentials variable. Please Help!!
did you find a workaround for this error? i am also facing the same issue.
when i connected my vscode to my EC2 instance i didnt get all those files in my vscode. just the .py file. i didnt get airflow file or any other files? please help!
nm i figured it out
Awesome.
each time i trigger my dag, it does not seem to respond. the border line around each task is always white. my might likely be the problem?
i also noticed my auto refresh in the graph page does not stay active like yours. it is deactivated even when i try to activate it
What EC2 instance did you use? micro, small, medium?
T3.small 2gig
@@godswillomonkhodion6393 Can you try recreating my project using medium? May be your EC2 was freezing. Try that and let me know what you find. Thanks!
@@tuplespectra alright. Will feedback after. Thanks
@@tuplespectra yea, this fixed the problem.
my ec2 instance is not loading using a t2 micro instance. could that be the only reason?
That's one possible reason. It has happened to me before.
It worked now. But now I’m facing a new issue. My airflow dag is not reflecting on airflow
i get my load_data task failed i configured everything right but still get failed for the last task i couldnt figure it out anyone with the same scenario got any soln?
If you have problems with installing dependencies it is because instead of sudo apt install3.10-venv, replate it to sudo apt install3-venv to get the latest version. Currently, it's at 3.12
Invalid operation upon using that.
i had an error after lunching airflow standalone can you help me please ?
SqlAlchemySessionInterface.__init__() missing 6 required positional arguments: 'sequence', 'schema', 'bind_key', 'use_signer', 'permanent', and 'sid_length'
i tried pip install Flask-Session but nothing , i had the same error in my wsl and i solved it the flask-session
how did you solve it can you please tell me ?
I am getting following error while creating a virtual environment.
E: Unable to locate package python3.11.9-venv
E: Couldn't find any package by glob 'python3.11.9-venv'
If anyone can help then it would be great!!
can i add this project in resme?
Yes absolutely! But make sure you complete the project first and ensure you understand the concepts explained.
Hi,
My DAG breaks because of "import pandas as pd". It's giving me below error is airflow:
Broken DAG: [/home/ubuntu/airflow/dags/weather_api_etl_asp.py] Traceback (most recent call last):
File "", line 241, in _call_with_frames_removed
File "/home/ubuntu/airflow/dags/weather_api_etl_asp.py", line 7, in
import pandas as pd
ModuleNotFoundError: No module named 'pandas'
I've pandas installed on my EC2 instance.
Try sudo pip install pandas if you haven't tried it.
@@tuplespectra I tried that too, tried uninstalling and installed again, specifically version compatible with python version. Lately I tried some python coding in jupyter lab/notebook and found same import error. Anyone else faced this weird import error? Not just in this project, but while running any py files? Please shed some light on this
I'm struck at airflow standalone command after running this command it continuessly kept printing like conversation between webserver and triggered. Please help me out
I believe it is working fine. You just need to go grab the user name and pw that airflow created for you and enter them on the airflow UI to sign in.
Let me know what you get.
@@tuplespectra if you could give me your Gmail I can send you some screenshots and I can explain you clearly within the mail where I am getting problem
@@tuplespectra to enter username and pw I need to start the airflow right..... By giving the command airflow standalone..... After giving airflow stand alone command the airflow ui is not coming
@@mudasurbasha5742 tuplespectra@gmail.com
I am trying to do this project but whenever I hit run button in airflow for the first dag (is_weather_api_ready) my SSH in VS code starts reconnecting and airflow just loads doesn't give output. If anyone can help would really appreciated.
What EC2 instance are you using?
@@tuplespectra Worked out finally !!
Thanks
This was my first project in Airflow and your explanation was amazing. Good work.
@@vrushalip1110 Awesome. I'm glad it worked out for you.
@@vrushalip1110 Great! Good job completing the project.
Airflow standalone gives an error saying ' No response from Gunicorn master within120 seconds
Help. I can't remote connect my vscode. It says could not establish connection
I have a video that explains how to do that from scratch. This is the link and let me know if it works for you. th-cam.com/video/sQQjMnEkGjs/w-d-xo.html
@@tuplespectra i follow your instruction. then when it come to connect to host. during opening remote it give me Could not establish connection to "airflow_project": The operation timed out. Im using mac btw
@@ashraf950901 I'm not sure if using mac is the issue. Can you make sure that the configurations in your config file is correct? Make sure the IdentityFile points to the absolute path of the .pem file. Also make sure that the host name is the current IPV4 of your current running EC2 instance. Let me know what you find.
great project share ur linkedin profile.
Thank you! www.linkedin.com/in/opeyemi-olanipekun-ph-d-pmp-certified-six-sigma-black-belt-02735133/
Is no one else not able to access the "Public IPv4 DNS” address (using the already created custom port 8080)? Mine times out and says This cite can't be reached.
Nevermind. I figured it out. I had to add "/login" at the end of the path (after :8080) for the Airflow login page to display.
Good job figuring it out.
@@tuplespectra Thanks! I'm pretty sure I didn't miss this part, but in the "transform_load_data" function, it's referring to an s3 bucket. I don't believe that was covered (unless it was in a different video). In any case...I'm having PutObject Access Denied issues w/ it. The security is a bit confusing even after looking it up, but is there a video you went over the s3 bucket? Or an easy fix for the access issue? Thanks.
@@tuplespectra I guess I was able to figure this one out, as well. It looks like I had to change the "Bucket policy" in the S3 bucket "Permissions". The Pub Object access was denied. I could've just put "Action": "s3:PubObject". But I just put wide open "Action": "s3:*".
@@tuplespectra Oh, I didn't realize you went over the S3 bucket AFTER going over the Python script. Haha...I guess it works either way.
i want data for every city in a particular country how can I do it
Check out this video where I showed how to extract for multiple cities th-cam.com/video/ocFzNmgYW9o/w-d-xo.html
@@tuplespectra thanks man
airflow standalone comaand giving below error: TypeError: SqlAlchemySessionInterface.__init__() missing 6 required positional arguments: 'sequence', 'schema', 'bind_key', 'use_signer', 'permanent', and 'sid_length'
At what time of the video are you getting this error?
I’m also facing the same problem …when i run airflow standalone command
its not taking me to the airflow login page with 8080
Did you start the airflow server (airflow standalone) before trying to load the airflow login page?
after running 'airflow standalone' in EC2 instance, i am getting this error - TypeError: SqlAlchemySessionInterface.__init__() missing 6 required positional arguments: 'sequence', 'schema', 'bind_key', 'use_signer', 'permanent', and 'sid_length' .......can anyone help me to resolve this
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/airflow/__main__.py", line 57, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/airflow/cli/cli_config.py", line 49, in command
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/airflow/cli/commands/standalone_command.py", line 53, in entrypoint
StandaloneCommand().run()
File "/usr/local/lib/python3.10/dist-packages/airflow/utils/providers_configuration_loader.py", line 55, in wrapped_function
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/airflow/cli/commands/standalone_command.py", line 69, in run
self.initialize_database()
File "/usr/local/lib/python3.10/dist-packages/airflow/cli/commands/standalone_command.py", line 179, in initialize_database
db.initdb()
File "/usr/local/lib/python3.10/dist-packages/airflow/utils/session.py", line 79, in wrapper
return func(*args, session=session, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/airflow/utils/db.py", line 733, in initdb
_create_db_from_orm(session=session)
File "/usr/local/lib/python3.10/dist-packages/airflow/utils/db.py", line 718, in _create_db_from_orm
_create_flask_session_tbl(engine.url)
File "/usr/local/lib/python3.10/dist-packages/airflow/utils/db.py", line 711, in _create_flask_session_tbl
db = _get_flask_db(sql_database_uri)
File "/usr/local/lib/python3.10/dist-packages/airflow/utils/db.py", line 700, in _get_flask_db
AirflowDatabaseSessionInterface(app=flask_app, db=db, table="session", key_prefix="")
TypeError: SqlAlchemySessionInterface.__init__() missing 6 required positional arguments: 'sequence', 'schema', 'bind_key', 'use_signer', 'permanent', and 'sid_length'
@@sumitrawall I am having the same problem. have you managed to solve it?
@@gyungyoonpark No brother😔
having the same issue
same error @tupelspectra
dont ask for likes, i will only like if i can finish and add this to my resume
Very strict haha
swear I always have issues with every. single. project. on youtube. nobody in the comments has issues for any projects, but me. lol...so nobody is having issues w/ installing airflow? really? smh
bro i feel the same too i guess noone is actually trying
@@user-ut8ln6ct4kissues are big part of IT projects. it's how you can overcome that.
Thanks for the tuto, just as remarque no need for hard coding credentials, since you gave iam role full access and assumed by the EC2. you can write directly in S3 from EC2
You're welcome! Thanks for the remark.
1:38:20 - i dont get why do i need aws configure INSIDE my ec2? why do i need access key when i m already inside my ec2?
Thank you so much
You're most welcome