Do you mind, i have a question, I stumbled into these project tutorials of yours and they are absolute gem for learners and students, so thank you for that. But do I need to have and AWS subscription in order to do these project or not, cuz I don't have money to buy one. thanks in advance.
Don't wait for a job. It comes for sure. Let's buy a good Coffee now! It is only 5 bucks for such great content! I've not found anywhere such content quality, clear explanation and consistency of the code which is perfectly reproducible. Thanks Yusuf!
Now beginning my Data Engineer journey and this tutorial is an absolute Gem! I was able to reproduce everything from A-Z and get it all running! Only glitch is the Broker service for some unknown reason always exits at some point so the vehicle never gets to the destination 😅. However I do still get the data on S3. Thanks again for this! Hope I can add this project to my portfolio. Looking forward to the visualisation part!
getting a Task not Serializable error while streaming data into S3. Checkpoints and data folders are being created in s3 but data from kafka is not getting pushed. Any idea why?
"Only glitch is the Broker service for some unknown reason always exits at some point" Hey if it helps, removing the KAFKA_METRIC_REPORTERS and all the CONFLUENT variables helped me not letting Kafka exits : )
Thank you so much Yusuf! After some challenges here and there I've been able to complete the project. As a newbie in data engineering, I've learned so much in this exercise and gained more confidence. Onto the next, which is spark unstructured streaming.
Your tutorials are just amazing. Makes all of this stuff make sense. I would love to see one of those projects where you also use infrastructure as code with terraform for example. I know that’s more on the devops side but I had to do that at my first job as well as data engineering and was kinda lost for a while.
I love your content Yusuf but you're when you're doing your project videos try not to jump around so much there's a lot of grey areas where the code isn't explained, or the video isn't going along with the code you posted. please consider that I want your channel to be one of the best. Ps I work at an edtech company, and I like to send my students to your channel
Great Yusuf! Thanks a lot for another terrific contribution! This is very helpful for me, as I want to implement a similar architecture for a project to driveschools here in Málaga. Just wondering how could we simulate a non-straight route between 2 points? Maybe I could get a route record (lat long) and passing it to kafka by timestamp record one each? I will replace the emergency topic for "paint points" where the students used to be suspended...
That could work… another would be to have an algorithm that simulates curves and bends every now and then and you could get before and after values in that case
for setting up spark with docker, can you use envv variable SPARK_MODE=worker/ SPARK_MODE=master instead of the command line to create master worker containers instead?
Hi, Do I need to know every single tool to start this project? I am currently learning the tools as part of my course but I would really like to get a project done and came across this.
you don't necessarily have to know it all. that's the whole point! you need the exposure and know how to pick up from there going forward. so keep learning!
Hello, I'm glad I follow your content from Latin America, my question is what route do you recommend I study to be a software architect oriented to smart cities or physical and digital integration systems, greetings from Venezuela
@@CodeWithYu Thank you very much for the answer, I appreciate it very much, please I would like a more technical answer, since my passion is architecture and programming
I am getting an error at about 1:50:00 in the video: ImportError: Pandas >= 1.0.5 must be installed; however, it was not found. It turns out my spark-master doesn't have enough packages, including pandas and pyarrow. I tried pip installing all of them, and then the error changed to something else that doesn't make sense Can anyone help point out what may have gone wrong?
Hi Yu, thanks for creating the amazing content. Some questions for the redshift part. Are the data physically loaded into the redshift or the data are actually stored in Glue or S3 bucket ? We are just redshift to read the data and maybe create another semantic layer on top of that in the next phase ?
I have a question.. can I include this project in my resume .. and can it help ?? .. I want to move into data engineering domain from QA ( my current role )…
I was running the project again, I can see the data in the S3 bucket. But, I am not able to crawl the data, I am getting an error like accout **** denied access.
@@OnkarPatole-eo5fx create an Iam Role for your glue crawler, The Iam Role should be from Glue to S3 Access(S3 Full Access). That would give Glue Crawler access to S3
Hey Yu!! I am a college student..and i am interested in DATA Engineer Field as A Fresher How much knowledge is enough. ? Like tools and all... And their level
You’ll need to find a way to connect Tableau to the workload. However, I’m not sure Tableau and PowerBI are designed for realtime streaming. So you might want to consider tools best suited for realtime visualization
Hello Mr Yu So I'm following your tutorial but I'm running into issues around 1:12:23 When I try to run the whole code Please there's no way to share my code on here but I followed you completely so I don't know if the error is from my system. The error is that I can't seem to access Kafka, I'm always getting an error
Hi, if you have completed arounf 1 hour can you please connect with me. I am running into issues in the beginning(docker-compose.yaml) itself. Beginner . Would love if provided some assistance from senior.
Producing IOT Data to Kafka, this where i got stuck. It kept telling me, failed to resolve broker:29092: No such host is known. I ensured that the host name is well configured with the broker still the issue is not resolve. Please I need help. Thank you.
Thank you Yu for this amazing content. I am facing an issue while submitting a spark job and getting this error. Any help would be appreciated. " 0 artifacts copied, 12 already retrieved (0kB/16ms) 24/04/23 22:47:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread "main" java.lang.NullPointerException: Cannot invoke "String.lastIndexOf(String)" because "path" is null"
Hi everyone! i'm stucked just at the end... time out trying to connect to redshift... has anyone set properly de vpc, security groups an permissions? im receiving time out in dbbeaver. my inbound rules are set to custom protocol TCP port 5439 any ipv4. I set publicly accesible enabled.. what am i missing? please help!
I have created again the cluster. Before: - VPC 1 avalibilty zone -a security group to this vpc with inbound port 5439 from my IP and alltrafic rule. outbound all traffic rule. -cluster subnet group -redshift cluster with public availabilty.. and I receive time out from dbeaver... :( PowerBI as well... any suggestion please?
What this usually means is that your cluster is still not accepting connections. Have you tried using the default configurations to test? Most times, it may be your configuration that's faulty.
@thYu oh GOD!! more than 4 hours back and forth! just deleted all de vpc in my account. created new VPC with 2 regions, security group by default and cluster subnetgroup clicking nex next xD. finally created my 3rd cluster and assigned the VPC. Didn't work... but! I went to the security group again and added a new inbound rule: custom - port 5439 - MyIP... and.... TA DA !!! dbeaver sucesfully connected! Thanks Yusuf!
Thanks, Yu . I watched the full video and liked it . Can you make a video on ETL pipeline using open-source modern data tech stack involving Duckdb , polaris etc ?
You're doing a great job of providing free tutorials. And charging 5 Euro for source code would not be ideal. Think from a long-term perspective. freecodecamp earns a lot with only TH-cam pay, so if you can allow everyone to support you in this initial phase, you never know what you could earn in the long term. Restricting by not providing source code would definitely affect your channel growth. This is my perspective. It's up to you. And thanks for the videos@@CodeWithYu
@@viethoangnguyen1264 The docker image is free same as confluent-kafka. But if you use confluent cloud, you get free credits for like a month then you can start paying afterwards if you want to continue using their services.
@@CodeWithYu but i read somewhere they said that confluent-kafka is not as supported as in java. Is it right? And 1 more question, confluent-kafka is just a library like numpy and pandas right
This is really indepth content but if I may offer a piece of advice. no one who knows data engineering well is watching these videos, you have to tailor your videos to people who dont know much. so skipping explaining why you are writing certain code is not good.
Just to let you know that when I ran the first time "python jobs/main.py" it returned automatically the following error "Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 31ms in state APIVERSION_QUERY)". I solved it setting the security protocol to "PLAINTEXT" in product_config dict: producer_config = { "bootstrap.servers": KAFKA_BOOTSTRAP_SERVERS, "error_cb": lambda err: print(f"Kafka error: {err}"), "security.protocol": "PLAINTEXT", }
Please how do fix this error without having to to use confluence-airflow Broken DAG: [/opt/airflow/dags/kafka-stream.py] Traceback (most recent call last): File "/home/airflow/.local/lib/python3.12/site-packages/kafka/record/legacy_records.py", line 50, in from kafka.codec import ( File "/home/airflow/.local/lib/python3.12/site-packages/kafka/codec.py", line 9, in from kafka.vendor.six.moves import range ModuleNotFoundError: No module named 'kafka.vendor.six.moves'
Please how do i fix this error without having to use confluence-kafka package Broken DAG: [/opt/airflow/dags/kafka-stream.py] Traceback (most recent call last): File "/home/airflow/.local/lib/python3.12/site-packages/kafka/record/legacy_records.py", line 50, in from kafka.codec import ( File "/home/airflow/.local/lib/python3.12/site-packages/kafka/codec.py", line 9, in from kafka.vendor.six.moves import range ModuleNotFoundError: No module named 'kafka.vendor.six.moves'
Please don't forget to LIKE and SUBSCRIBE! 🥺
do you have link to project too , as it github or somewhere where you have committed it ?
Is there any github link .
Do you mind, i have a question, I stumbled into these project tutorials of yours and they are absolute gem for learners and students, so thank you for that.
But do I need to have and AWS subscription in order to do these project or not, cuz I don't have money to buy one.
thanks in advance.
May I ask what the offerings of your membership list are, like what perk is available for a sergeant ,recruit or corporal tiers
You know you are such a gem, amazing quality paid work for free i will buy you a coffee onces i get a job brother. Keep this work on
Thanks for the kind words! Looking forward to the coffee! 😉
@@CodeWithYu sure will have one
Don't wait for a job. It comes for sure. Let's buy a good Coffee now! It is only 5 bucks for such great content! I've not found anywhere such content quality, clear explanation and consistency of the code which is perfectly reproducible.
Thanks Yusuf!
This is the best data engineering content have since on youtube so far. Thanks for this.
Thank you! Don't forget to spread the word! ❤️
@@CodeWithYu please can you help with a road map on data engineer. I'm a BI analyst want to transit.
Now beginning my Data Engineer journey and this tutorial is an absolute Gem! I was able to reproduce everything from A-Z and get it all running! Only glitch is the Broker service for some unknown reason always exits at some point so the vehicle never gets to the destination 😅. However I do still get the data on S3. Thanks again for this! Hope I can add this project to my portfolio. Looking forward to the visualisation part!
getting a Task not Serializable error while streaming data into S3. Checkpoints and data folders are being created in s3 but data from kafka is not getting pushed. Any idea why?
"Only glitch is the Broker service for some unknown reason always exits at some point"
Hey if it helps, removing the KAFKA_METRIC_REPORTERS and all the CONFLUENT variables helped me not letting Kafka exits : )
Thank you so much Yusuf! After some challenges here and there I've been able to complete the project. As a newbie in data engineering, I've learned so much in this exercise and gained more confidence. Onto the next, which is spark unstructured streaming.
Fantastic!
Thank you so much !! You are a good teacher.
You are welcome!
Thank you very much have a nice day
Another amazing pick
Great Job Yu. Thanks for helping the humanity :)
My pleasure!
Subbed! Thanks a lot for your kindness to share this amazing wisdom and knowledge!
You’re welcome!
Your tutorials are just amazing. Makes all of this stuff make sense. I would love to see one of those projects where you also use infrastructure as code with terraform for example. I know that’s more on the devops side but I had to do that at my first job as well as data engineering and was kinda lost for a while.
such a great project for free hatsoff to you man🥰
You amazing!! Keep going!!
always inspiring with handful content, keep up the good work
I can give basic changes like Use AWS EMR in placed of AWS glue and put this project in resume and LinkedIn
That’s another interesting angle to it! 🔥
Nice work..waiting for dbt and snowflake 🎉🎉😊
Incoming… watch out! 😀
Thank you very much for this video, I learnt alot from it.
Thank you for watching and learning from the video, it means a lot!
Wow...! such a great content
Thank you very much for all your project!
Could you please make a end to end project with delta live tables in databricks?
Sure thing! Don't forget to suggest this in the community section!
excellent video!
Glad you liked it!
I love your content Yusuf but you're when you're doing your project videos try not to jump around so much there's a lot of grey areas where the code isn't explained, or the video isn't going along with the code you posted. please consider that I want your channel to be one of the best. Ps I work at an edtech company, and I like to send my students to your channel
Great job ! 👏👏 Inspired !!!
I’m glad the project got you inspired!
was an amazing Tutorial! you are a badass! very needy this end to end projects! and yes can you do how to connect to power bi please thanks!
We'll see if there are more requests
If I'm not wrong, PBI has a connector available for redshift
Great Yusuf! Thanks a lot for another terrific contribution!
This is very helpful for me, as I want to implement a similar architecture for a project to driveschools here in Málaga.
Just wondering how could we simulate a non-straight route between 2 points? Maybe I could get a route record (lat long) and passing it to kafka by timestamp record one each?
I will replace the emergency topic for "paint points" where the students used to be suspended...
That could work… another would be to have an algorithm that simulates curves and bends every now and then and you could get before and after values in that case
Great content. Thank you!!!
Hopefully i can complete this project + i'm trying develop it using PDM and nix-shell :)
I like your T-Shirt Yusuf 😀😀😀
Haha 😀 thanks Morshed!
for setting up spark with docker, can you use envv variable SPARK_MODE=worker/ SPARK_MODE=master instead of the command line to create master worker containers instead?
I suppose that could work so far the containers are not using the same KEY-VALUE pairs in the env
Thank you.
My pleasure
Hi, Do I need to know every single tool to start this project? I am currently learning the tools as part of my course but I would really like to get a project done and came across this.
you don't necessarily have to know it all. that's the whole point! you need the exposure and know how to pick up from there going forward. so keep learning!
It would be really nice if you could share Kafka configuration docs link so we can refer for explanation of configuration.
Apache Kafka official website is in the description
Thank you for the great video. Can somebody help me where to find to copy at 13:50 Docker env variables
Thank you so much
You're most welcome
Hello, I'm glad I follow your content from Latin America, my question is what route do you recommend I study to be a software architect oriented to smart cities or physical and digital integration systems, greetings from Venezuela
Hi, you should choose the route that aligns with your passion
@@CodeWithYu Thank you very much for the answer, I appreciate it very much, please I would like a more technical answer, since my passion is architecture and programming
I am getting an error at about 1:50:00 in the video:
ImportError: Pandas >= 1.0.5 must be installed; however, it was not found.
It turns out my spark-master doesn't have enough packages, including pandas and pyarrow. I tried pip installing all of them, and then the error changed to something else that doesn't make sense
Can anyone help point out what may have gone wrong?
Sir I have a doubt, can't we directly push our transformed data into warehouse, w/o passing from AWS glue architecture.
Please tell me through which platform this diagram was made
Hi Yu, thanks for creating the amazing content. Some questions for the redshift part. Are the data physically loaded into the redshift or the data are actually stored in Glue or S3 bucket ? We are just redshift to read the data and maybe create another semantic layer on top of that in the next phase ?
Amazing Project could you do one for Azure also would be awesome thanks a lot :)
Yes, soon!
@@CodeWithYu one for GCP too...😅
Can you share some good resources from where I can learn pyspark ? Or from where you learn ..
Thanks for the content,
is there any free alternative for DBeaver
You can try SQL Workbench
Where did you get the data in this project
I have a question.. can I include this project in my resume .. and can it help ?? .. I want to move into data engineering domain from QA ( my current role )…
Hi Can you do a video on how these crypto exchanges show the real time data.
Hello Yusuf, I am not able to connect my Redshift cluster with DBeaver. Could you please tell me what would be an issue?
Majorly your VPC/Firewall permission.
I was running the project again, I can see the data in the S3 bucket. But, I am not able to crawl the data, I am getting an error like accout **** denied access.
@@OnkarPatole-eo5fx create an Iam Role for your glue crawler, The Iam Role should be from Glue to S3 Access(S3 Full Access).
That would give Glue Crawler access to S3
Hey Yu!! I am a college student..and i am interested in DATA Engineer Field as A Fresher How much knowledge is enough. ? Like tools and all... And their level
For transformation, did you use pyspark.
Domain name pls
Yes, pyspark was used
how can we change this to tableau visulaizations
You’ll need to find a way to connect Tableau to the workload. However, I’m not sure Tableau and PowerBI are designed for realtime streaming. So you might want to consider tools best suited for realtime visualization
Hello Mr Yu
So I'm following your tutorial but I'm running into issues around 1:12:23
When I try to run the whole code
Please there's no way to share my code on here but I followed you completely so I don't know if the error is from my system. The error is that I can't seem to access Kafka, I'm always getting an error
Hi, if you have completed arounf 1 hour can you please connect with me. I am running into issues in the beginning(docker-compose.yaml) itself. Beginner . Would love if provided some assistance from senior.
there are no entrylevel fresher jobs for de, should a fresher target for data analyst instead?
Do you work with something like this in your current data engineer position?
Producing IOT Data to Kafka, this where i got stuck. It kept telling me, failed to resolve broker:29092: No such host is known. I ensured that the host name is well configured with the broker still the issue is not resolve.
Please I need help. Thank you.
You need to run it on docker to recognize your broker. Otherwise you'll need to change broker to localhost
I had a question, which arch are you using in your mac? AMD or ARN?
aarch64 or arm64
Can you pls share documents regarding this project so that we will put in resume
You can get that in the source code. The details in the video description
Hi Yusuf, i am a beginner, can i work with the community version of PyCharm for these projects ?
Yes, but this is still a little high level for absolute beginner. You may want to check out a more suitable version of this on datamasterylab.com
@@CodeWithYu does this describe the same project in detail?... in a way a beginner can understand?...
Thank you Yu for this amazing content. I am facing an issue while submitting a spark job and getting this error. Any help would be appreciated.
" 0 artifacts copied, 12 already retrieved (0kB/16ms)
24/04/23 22:47:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "String.lastIndexOf(String)" because "path" is null"
Hi Yusuf, Could you share the Architecture Diagram as well?
Hi, can this course be taken by someome who is a complete beginner into data engineering ?
Yes, but this is still a little high level for absolute beginner. You may want to check out a more suitable version of this on datamasterylab.com
Hi everyone!
i'm stucked just at the end... time out trying to connect to redshift...
has anyone set properly de vpc, security groups an permissions?
im receiving time out in dbbeaver.
my inbound rules are set to custom protocol TCP port 5439 any ipv4.
I set publicly accesible enabled..
what am i missing?
please help!
I have created again the cluster.
Before:
- VPC 1 avalibilty zone
-a security group to this vpc with inbound port 5439 from my IP and alltrafic rule.
outbound all traffic rule.
-cluster subnet group
-redshift cluster with public availabilty..
and I receive time out from dbeaver... :(
PowerBI as well...
any suggestion please?
What this usually means is that your cluster is still not accepting connections. Have you tried using the default configurations to test? Most times, it may be your configuration that's faulty.
Also, open your inbound and outbound port as well and associate the right VPC to your cluster
@thYu oh GOD!! more than 4 hours back and forth!
just deleted all de vpc in my account. created new VPC with 2 regions, security group by default and cluster subnetgroup clicking nex next xD.
finally created my 3rd cluster and assigned the VPC. Didn't work...
but! I went to the security group again and added a new inbound rule:
custom - port 5439 - MyIP... and.... TA DA !!! dbeaver sucesfully connected!
Thanks Yusuf!
You're welcome!
Thanks, Yu . I watched the full video and liked it . Can you make a video on ETL pipeline using open-source modern data tech stack involving Duckdb , polaris etc ?
Yeah sure… thanks for the suggestions
Need visualization pll😊
Hahaha… I guess we’ll see if more people requests for it
Hi is the full code available in the video?
In the description
could i do this in free tier account ?
Yes.
I paid 0.4€ because I made this project with some modifications and more data volume. All for free thanks to the aws free tier. Cool doesn't it?
Hi somebody have the errror Shutdownhook called ??? :(
thank you for all videos, Please can you sheare with us the source code
Link to the source code is in the description
bro is putting a 5 dollar payment for the source code :3
You're doing a great job of providing free tutorials. And charging 5 Euro for source code would not be ideal. Think from a long-term perspective. freecodecamp earns a lot with only TH-cam pay, so if you can allow everyone to support you in this initial phase, you never know what you could earn in the long term. Restricting by not providing source code would definitely affect your channel growth. This is my perspective. It's up to you. And thanks for the videos@@CodeWithYu
can u post the dataset in here please?
The good thing is you don’t need a dataset, just run the code and the data gets created automatically.
@@CodeWithYu btw i dont get about confluent, do we have to pay a lot of money to use confluent-kafka sir? or u just use the trial 1 month?
@@viethoangnguyen1264 The docker image is free same as confluent-kafka. But if you use confluent cloud, you get free credits for like a month then you can start paying afterwards if you want to continue using their services.
@@CodeWithYu but i read somewhere they said that confluent-kafka is not as supported as in java. Is it right? And 1 more question, confluent-kafka is just a library like numpy and pandas right
anybody else run into an issue when mounting to Docker?
🤗
🥰
This is really indepth content but if I may offer a piece of advice. no one who knows data engineering well is watching these videos, you have to tailor your videos to people who dont know much. so skipping explaining why you are writing certain code is not good.
Just to let you know that when I ran the first time "python jobs/main.py" it returned automatically the following error "Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 31ms in state APIVERSION_QUERY)".
I solved it setting the security protocol to "PLAINTEXT" in product_config dict:
producer_config = {
"bootstrap.servers": KAFKA_BOOTSTRAP_SERVERS,
"error_cb": lambda err: print(f"Kafka error: {err}"),
"security.protocol": "PLAINTEXT",
}
Please how do fix this error without having to to use confluence-airflow
Broken DAG: [/opt/airflow/dags/kafka-stream.py]
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.12/site-packages/kafka/record/legacy_records.py", line 50, in
from kafka.codec import (
File "/home/airflow/.local/lib/python3.12/site-packages/kafka/codec.py", line 9, in
from kafka.vendor.six.moves import range
ModuleNotFoundError: No module named 'kafka.vendor.six.moves'
Please how do i fix this error without having to use confluence-kafka package
Broken DAG: [/opt/airflow/dags/kafka-stream.py]
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.12/site-packages/kafka/record/legacy_records.py", line 50, in
from kafka.codec import (
File "/home/airflow/.local/lib/python3.12/site-packages/kafka/codec.py", line 9, in
from kafka.vendor.six.moves import range
ModuleNotFoundError: No module named 'kafka.vendor.six.moves'
Python 3.12 is a little problematic at this time. Can you try using Python 3.9 or 3.10? It should fix your errors
@@CodeWithYu I was initially running python 3.9, i switched to 3.12 because of the error.