Add RDS Data Source In AWS Glue

DataEng Uncomplicated

มุมมอง 40 964

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 13 ธ.ค. 2024

ความคิดเห็น • 88

@BeABetterDev 4 ปีที่แล้ว ⁺²
Glue has so much depth to it. Great video!
@DataEngUncomplicated 4 ปีที่แล้ว ⁺¹
Thank you! There many components to AWS Glue. I will be making more videos and tutorials about glue soon!
@fabian-manzano 3 ปีที่แล้ว
@@DataEngUncomplicated I was also wondering to populate the stpes after will be to add node transform and node output data catalog? I did this but I am getting error: An error occurred while calling o106.pyWriteDynamicFrame. ERROR: duplicate key value violates unique constraint
@DataEngUncomplicated 3 ปีที่แล้ว
@@fabian-manzano Yes the steps after would be to add a node transform and node output depending on what you are trying to do in your workflow. It seems that you have received an error message on write because you attempted to write a record which violates the unique constraint if you are writing it to a database. Perhaps you have a duplicate record in your dataset.
@shovan3112 ปีที่แล้ว ⁺¹
You are doing amazing job simplifying things for common people who dont have the aws background. Please keep it up. Your channel will get millions of subscriptions over time for sure. Good luck brother.
@DataEngUncomplicated ปีที่แล้ว
Thanks for the kind words shovan!
@vierminus 2 ปีที่แล้ว
Intro so on point, very nice 😅
@DataEngUncomplicated 2 ปีที่แล้ว
Thanks Vierminus 😆
@imransadiq5851 ปีที่แล้ว
Thank you for the superb video. I want to ask how to create connection if my RDS SQL Server db instance is in another AWS account not in the same account where i am creating connection.
@maheshmushyam8153 ปีที่แล้ว
Can you make a video on adding the endpoint to connect publicly accessible RDS with Glue?
@javiermadriz7834 5 หลายเดือนก่อน
My databases is in the default vpc however an error occurred and this mentioned s3 endpoint, Why I need s3 endpoint if my database is at the same vpc?
@ajinkyarajane917 2 ปีที่แล้ว
Hi @DataEng Uncomplicated, I have a question here, Why did we use JDBC as the Node type (data source)? Can't we directly select RDS as Node type or as Data source?
@DataEngUncomplicated 2 ปีที่แล้ว
Hi ajinkya, I believe when I made this video this wasn't an option but it appears it is now so go ahead and use it
@jovidog9573 7 หลายเดือนก่อน
Hello. I made a Glue Job that performs ETL changes to data in an S3 Bucket and exports the changed data to a Redshift database, but now I'm thinking of changing from Redshift to PostgreSQL. I know this video is for importing RDS data into Glue, but if I follow the video's instructions, would I also be able to export it back into RDS?
@DataEngUncomplicated 7 หลายเดือนก่อน
Hi, This video is only about how to add an RDS data source like postgres to AWS Glue Catalog. So if you establish your postgres database connection, you should be able to read and write data to it.
@iamdare 2 ปีที่แล้ว
Hi! Great job. When setting up your "access to your data store, ",how did you create that ETLDEMO" instance
@DataEngUncomplicated 2 ปีที่แล้ว
Hi Dare, for this example, I just manually created it in the RDS Console to create my postgres instance.
@mangeshxjoshi 3 ปีที่แล้ว
good explanation , does aws glue etl tool support change data capture transformation to any rds database . assuming S3 files will be loaded initially into postgre sql db and other incremental S3 files (delta files) will updated to postgre sql ,
or is any other custom code need to write to handled delta , i did not see any transformation in aws glue to handled CDC data
@DataEngUncomplicated 3 ปีที่แล้ว
Thanks Mangesh!
Glue has a bookmarking feature which keeps track of what records have been processed previously. I would look into this to see if it meets your use case. If you have bookmarking enabled, you don't need to write custom code because it will keep track of what records you have processed previously and won't process these records again.
@chitraalavanthar3729 3 ปีที่แล้ว ⁺¹
How will you load partition table to data lake ?
@DataEngUncomplicated 3 ปีที่แล้ว
When selecting the "data target" node to write your data, make sure to add your partition into the "Partition keys" parameter.
@code1530 2 ปีที่แล้ว
awesome! easy to follow instructions. One question. Is it possible to crawl data from RDS with table classification as csv? It output postgresql by default.
@DataEngUncomplicated 2 ปีที่แล้ว ⁺¹
Thanks, I'm not sure actually.
@code1530 2 ปีที่แล้ว
Got it bro. I want to querry crawl results to athena without a glue job
@DataEngUncomplicated 2 ปีที่แล้ว
Yea you don't need to use a glue job just to crawl...You can use the glue crawler to crawl postgres
@redolfmahlaule9893 3 ปีที่แล้ว ⁺¹
hi sir ,after reading data from my postgresSQL using aws glue how to take it to s3 ?i will appreatiate your reply
@DataEngUncomplicated 3 ปีที่แล้ว
You have many AWS service options to achieve this depending on your data size and type of data you are working with. A popular method of building data pipelines is using AWS glue. If you want a no code option to develop a glue job, check out my glue studio overview video to learn more: th-cam.com/video/NuGqN3Aj07M/w-d-xo.html
If you code in python and are a fan of working with pandas, another option could be leveraging the python library aws data wrangler to do this: th-cam.com/video/5pVpFnvRDW4/w-d-xo.html
@redolfmahlaule9893 3 ปีที่แล้ว
hi sir,can you assist me how can you trace glue job in x-ray using xray-daemon sdk
@DanielWeikert 2 ปีที่แล้ว
This does not work for me due to routing vpc nat gateway issues. Do you have a video on how to cofigure this`?
@DataEngUncomplicated 2 ปีที่แล้ว
Hi Daniel, sorry I don't have a video on this.
@sriramkrishnaswamy5595 ปีที่แล้ว
why do u need an aws vpc gateway endpoint? Its a bit confusing . We are trying to connect rds to glue. Shouldnt the endpoint be connected to the glue and not s3?
@DataEngUncomplicated ปีที่แล้ว
Sorry for the delay in response.
we need an S3 VPC endpoint when configuring an RDS database instead of a Glue VPC endpoint because Glue stores its scripts and temporary files in an S3 bucket. Even though the Glue job connects to RDS, it still needs to access S3 for these files.
Setting up an S3 VPC endpoint provides private connectivity between the VPC and S3, without exposing the connection to the public internet. This allows Glue to securely access the S3 bucket.
Some key points:
Glue stores scripts and temp files in S3, so it needs access to S3 even if the job connects to RDS.
A VPC endpoint for S3 enables private connectivity from the VPC to S3 over the AWS network, without a public IP address.
@sumanbhattacharjee8839 2 ปีที่แล้ว
Do you create vpc endpoint to S3 service or to glue? Glue is connection to the database right? do you have a video on how to create the endpoint?
@DataEngUncomplicated 2 ปีที่แล้ว ⁺¹
Hi Suman, I created a vpc to the s3 service and not the glue service. Sorry, I don't have a video on adding a vpc endpoint but I will add it to my list of future videos if you think this would be helpful for others. let me know what you think?
@sumanbhattacharjee8839 2 ปีที่แล้ว ⁺¹
@@DataEngUncomplicated Hi, Thank you for responding. I was facing issue with this. I have RDS MySQL database and I'm not able to connect from glue. The DB is accessible local tools like DBeaver. Even Lambda can connect to the same database. All the services are on the same vpc, region and security group. So this endpoint creation video can help me to solve the issue...
@giancarlopoemape5041 ปีที่แล้ว
@@sumanbhattacharjee8839 Hi, I'm facing the same issue. Have you resolved it?
@AronBergara ปีที่แล้ว
on VPC page, look for the Endpoint menu option, then create a new endpoint for S3 of the type "Gateway" on the same VPC and subnet of the DB instance.@@giancarlopoemape5041
@gus882008 3 ปีที่แล้ว
Hi there! Great Video but I have a Question? Actually, We have a Rol to connect to S3 Bucket, is necessary that this Role have permission in redshift? or not is it necessary thanks
@DataEngUncomplicated 3 ปีที่แล้ว
Hi Gustavo, In aws glue, you need to set up a database connection so you can read data from redshift. This videoo was specifically for RDS Databases which does not include redshift. You will need to pass in the user name and password of the redshift role to the database connection so you can connect to data from this database.
@melanijagerasimovska6152 2 ปีที่แล้ว
Great video:) I have one question, why do we add endpoint to the s3 service and not to rds service (rds is our source)?
@DataEngUncomplicated 2 ปีที่แล้ว ⁺¹
Thanks Melanija, I could have done a better job explaining the reason why in my video and it's been so long that I forgot the reason. I'm going to re-create the connection to see why and get back to you.
@quinnmichael2657 2 ปีที่แล้ว
@@DataEngUncomplicated Hey, checking in on this. We're seeing "data previews" for the source and the transformation steps but then blank in the target (S3). Thanks in advance!
@sebasfavaron 3 ปีที่แล้ว
I get an error code 30 when testing the connection. Could I be testing with wrong credentials? I've run out of ideas to debug it
@DataEngUncomplicated 3 ปีที่แล้ว
Are you sure you have the right port number for the database?
@ManojKumar-vp1zj ปีที่แล้ว
Hi, my instance is not popping up into Instance section. can you pls guide me how to do this? This in advance
@DataEngUncomplicated ปีที่แล้ว
Hi, are you in the same region of your instance?
@ManojKumar-vp1zj ปีที่แล้ว
@@DataEngUncomplicated Sorted bro... You are doing amazing work. Pls create more video tutorials. I saw all your videos in last 2 days many times.
@DataEngUncomplicated ปีที่แล้ว
@@ManojKumar-vp1zj Thanks for the kind words! I'm working on it!
@kowshicnatarajan 3 ปีที่แล้ว
Hi,
When dealing with a postgesql table which has a primary key a column "Id" it's impossible for any glue job to reference it.
If we dig into the error log, here is the following exact error:
ERROR: column "Id" does not exist
@DataEngUncomplicated 3 ปีที่แล้ว ⁺¹
Strange, can you see the id column in the AWS glue catalog table?
@kowshicnatarajan 3 ปีที่แล้ว
@@DataEngUncomplicated yes
@DataEngUncomplicated 3 ปีที่แล้ว
That's strange I haven't seen this before. I wonder if it's an issue with it searching it as a lower case vs mixed case
@kowshicnatarajan 3 ปีที่แล้ว
@@DataEngUncomplicated All my other column names are Mixed Case and I have no problem referencing them. This only occurs when the column is called "Id" and primary key.
@aaddiis45021 3 ปีที่แล้ว
I am not getting any instance option in instance selection
edit used jdbc option and was able to get it
@ViniciusCassalesDev 3 ปีที่แล้ว
I hava same problem. Can't solve it yet using rds connection type.
@aaddiis45021 3 ปีที่แล้ว
@@ViniciusCassalesDev use jdbc connection. Google jdbc dabasr link
@ViniciusCassalesDev 3 ปีที่แล้ว ⁺¹
@@aaddiis45021 I Need to do with RDS Connection
@DataEngUncomplicated 3 ปีที่แล้ว
Is your database in the same region of your glue catalog?
@DataEngUncomplicated 3 ปีที่แล้ว
Is our RDS Database in a VPC? if so, you will need to add a vpc endpoint.
@SafaaSelim 3 ปีที่แล้ว
nice video, one question is what if the RDS is in a different account ?
@DataEngUncomplicated 3 ปีที่แล้ว
Good question! There are two AWS Glue methods for granting cross-account access to a resource:
Use a Data Catalog resource policy or
use an IAM role
@helovesdata8483 ปีที่แล้ว
I keep getting test connection failed with no additional information . I created the vpc endpoint to s3 with route tables and My vpc is publicly accessible 🤯 Its not creating the tables
@DataEngUncomplicated ปีที่แล้ว
Hey, Does your IAM Role have sufficient permissions?
@helovesdata8483 ปีที่แล้ว
@@DataEngUncomplicated Yes I created a roll to give glue access to S3. I was thinking if permissions was an issue it would give me an error about access. It's only saying test connection failed
@aneeshmarathe7269 3 ปีที่แล้ว
Amazing explanation, thank you for this. I followed the same method as yours where i am able to get the tables attached to the database using Crawlers. Also tried building a Spark script using Glue studio. However, i am still not able to connect to RDS from Glue. Tried all possible ways to debug. Do help me out here
@DataEngUncomplicated 3 ปีที่แล้ว
Hi Aneesh, thanks for the comment! Was your crawler able to successfully crawl the database and find the tables? One common issue I see is that the Rds is usually in a vpc so you will need to add a vpc endpoint so your database can communicate with the aws glue service. If this is not the issue, is there any error messages that come up?
@aneeshmarathe7269 3 ปีที่แล้ว
@@DataEngUncomplicated Thanks for your reply. Yes, my crawler was easily able to find the table associated with the database. Issue that I am facing is when connecting/migrating data from AWS Glue either via Python/Pyspark scripts. Below is the error I am getting :
ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): InvocationTargetException java.lang.reflect.InvocationTargetException
Exception in User Class java.lang.reflect.UndeclaredThrowableException
Caused by: java.net.ConnectException: Connection refused
@DataEngUncomplicated 3 ปีที่แล้ว
One suggestion I have is creating a super simple glue studio job that reads from this database. If you can read successfully, you can rule out an issue with the vpc. If you still have an issue than you might have some issue with your pyspark code.
@aneeshmarathe7269 3 ปีที่แล้ว
@@DataEngUncomplicated Thank you, As per your suggestion I gave a try to just read the tables data from Glue studio but ended up with below stated error:
Py4JJavaError: An error occurred while calling o64.getDynamicFrame. : com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP connection to the host xxxxx, port 1433 has failed. Error: "Connection timed out: no further information
However, I am not facing any issues while connecting to RDS using SQL Server tool/Python/AWS Crawlers too.
I am not understanding what am I missing here.
@aneeshmarathe7269 3 ปีที่แล้ว ⁺¹
@@DataEngUncomplicated Thank you for your suggestion, It was a VPC issue. Took some time to figure out however, I was able to transfer data from S3 to RDS. Thanks for all your help :)
@AJEETKUMAR-yj8tv ปีที่แล้ว
Hi sir
I have created one MySQL instance and also created few table with sample data ,then I have created database in data catalog and now I want create connection of database in AWS glue then it is throwing error like invalid parameter , I am unable to fix this error, pls help me to fix this error
@DataEngUncomplicated ปีที่แล้ว
Hi there, try posting on AWS repost or AWS support with more information about your issue to see if someone can help you out!
@codingbreak8032 2 ปีที่แล้ว
What if i want to connect it to ec2? Is it possible ?
@DataEngUncomplicated 2 ปีที่แล้ว
Hi, do you mean to a database on an ec2 machine?
@codingbreak8032 2 ปีที่แล้ว
@@DataEngUncomplicated yes , can the AWS Glue connect to a postgres hosted in ec2 instance? to be specific the postgres is a legacy version 9.3
@DataEngUncomplicated 2 ปีที่แล้ว ⁺¹
I did a quick check for you, yes! it's possible, you need to add a new database connection and instead of choosing "RDS" make sure to select the "connection type" as "JDBC" and it should work!
@codingbreak8032 2 ปีที่แล้ว
@@DataEngUncomplicated thank you ! Will try this after my vacation. Keep it up bro!
@codingbreak8032 2 ปีที่แล้ว
@@DataEngUncomplicated hi , what will I input in the jdbc url? just the ip address for the host?
@AbdulrhmanEmad54 6 หลายเดือนก่อน
i guess you forgot to show how to make a connection in pgadmin first
@DataEngUncomplicated 6 หลายเดือนก่อน
Can you explain why you think you need to make a connection in pgadmin first? I walked through how to create the database connection in the glue catalog.
@AbdulrhmanEmad54 6 หลายเดือนก่อน
@@DataEngUncomplicated sorry iam new to this so forgive me if i am asking silly questions, isn't the data stored locally on your computer so you have to make a connection there first if not how can glue find where it's and how it automatically recognized etldemo
@AbdulrhmanEmad54 6 หลายเดือนก่อน
@@DataEngUncomplicated Oh silly me i got confused with data migration my bad😅
@chitraalavanthar3729 3 ปีที่แล้ว
How will you load partition table into data lake ?
@DataEngUncomplicated 3 ปีที่แล้ว
There are many different ways this can be achieved. Using AWS data wrangler for example, there is a parameter to specify the partition columns you want to use.

ต่อไป

เล่นอัตโนมัติ

AWS Glue ETL Vs EMR - Which one should I use?