AWS Tutorials - Using Glue Job ETL from REST API Source to Amazon S3 Bucket Destination

AWS Tutorials

มุมมอง 14 067

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 24 มิ.ย. 2020
Workshop link - aws-dojo.com/workshoplists/wo...
In many scenarios, you are required to build an AWS Glue job which calls a REST API to fetch data for the ETL purpose. Such jobs can be configured to run either with a schedule or an event. The REST API could be deployed within the AWS Account or outside. In this workshop, you create an AWS Glue Job which calls the REST API hosted outside AWS Account. With little changes, you can create job which can call APIs hosted within the AWS Accounts as well.
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 57

@ganeshmaximus9604 2 ปีที่แล้ว ⁺²
NIce Work !! Your Channel is unique. You deserve have more Suscribers.
I have tested this with Lambda and it works Great.
@AWSTutorialsOnline 2 ปีที่แล้ว
Thanks a lot!
@muqeempashamohammed3394 3 ปีที่แล้ว ⁺³
Thank you and very nice tutorial to understand
@AWSTutorialsOnline 3 ปีที่แล้ว
Glad it was helpful!
@savirawat6671 ปีที่แล้ว ⁺¹
Is it possible to pull data from S3 json body ..use that json body to trigger (post json data into the external Api)
and load data into my Api via AWS glue?
Could you share some sample /reference?
@veerachegu 2 ปีที่แล้ว ⁺¹
Tq nice explaination
@AWSTutorialsOnline 2 ปีที่แล้ว
Thanks
@devopsdigital2834 8 หลายเดือนก่อน
How you have used the Public subnet in this demo ?
@drew4849 3 ปีที่แล้ว ⁺²
Hello there. I'm new to aws. We have to create something like this at my new job. We need an etl to extract data from an API on the internet and save the data to S3. My question is about the pricing. Since we will only use the ETL once a month, what do you suggest to do about the vpc, allocated ip, nat that are paid per hour. Sorry if I'm misunderstanding some things. These are all pretty new to me. Your help will be greatly appreciated. Thanks.
@AWSTutorialsOnline 3 ปีที่แล้ว ⁺⁴
Hello Drew. Apologies for late reply. There are two approach you can take -
1) If you already have a VPC which is used for other workloads, you can leverage that for this external API call.
2) Otherwise - create a CloudFormation template which sets up everything - VPC, Subnets, NAT, Glue etc. You schedule a Lambda function which run every month. It first creates the whole infrastructure using CloudFormation stack, then run Glue Job and finally delete the CloudFormation Stack.
Hope it helps,
@drew4849 3 ปีที่แล้ว
@@AWSTutorialsOnline hey, thanks for the reply!
@alwayssporty8102 ปีที่แล้ว
Hey maybe I’m couple of yrs late but we have the exact same requirement and I’m quite new to some AWS services, did you implemented that?
@sodiqafolayan4921 3 ปีที่แล้ว ⁺¹
Hello, after completing the workshop, it tried to build it with cloudformation. I however keep getting Validation for connection properties failed (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException. See below my glue connection resource code Can you advise on how to correct this?
MyGlueConnection:
Type: AWS::Glue::Connection
Properties:
CatalogId: !Ref AWS::AccountId
ConnectionInput:
ConnectionProperties:
JDBC_CONNECTION_URL: jdbc:dummy:80/dev
ConnectionType: JDBC
PhysicalConnectionRequirements:
AvailabilityZone: us-east-1b
SubnetId: !Ref GlueJobPrivateSubnet
@AWSTutorialsOnline 3 ปีที่แล้ว ⁺¹
Hi, I think I replied to you few days back but I am not sure what happened to that comment. I will recommend not to use this fake JDBC method. AWS has now come up with a connection type of "Network". It enables connection to the internet as long as the subnet has internet access. Hope it helps.
@luisg6965 3 ปีที่แล้ว ⁺¹
Hello, I have an error "Error 110 time out ". Can you help me ?
Thanks for the video
@AWSTutorialsOnline 3 ปีที่แล้ว ⁺¹
It seems you have not configured glue connection properly. This blog is little old where Glue Network Connection feature was not available. Try this new workshop which uses network connection. Hope it also solves your problem.
aws-dojo.com/workshoplists/workshoplist26/
@veerachegu 2 ปีที่แล้ว ⁺¹
Same process is applicable while I am pulling the on prem air table to Redshift via API
@AWSTutorialsOnline 2 ปีที่แล้ว
yes. the same process for any API level integration
@josemanuelgutierrez4095 ปีที่แล้ว ⁺¹
do you have any videos about APIGEE integrating it with aws ??
@AWSTutorialsOnline ปีที่แล้ว
Sorry - I don't have one
@gowthamavinash9 3 ปีที่แล้ว ⁺²
Thank you for the tutorial. I am new to Glue and have a question. Is it possible to insert the data directly into the database instead of storing in s3 in the same script?
@AWSTutorialsOnline 3 ปีที่แล้ว ⁺²
Yes. It is possible. You can use AWS Data Wrangler to populate dataframe with API output and then use AWS Data Wrangler RDS methods (to_sql) to write to RDS database. Have a look at this link - github.com/awslabs/aws-data-wrangler
@gowthamavinash9 3 ปีที่แล้ว
@@AWSTutorialsOnline Thank you for the response. I am trying to connect to sql server database but receiving pyodbc module not found error even after I pointed the .whl files for aws wrangler and pyodbc. Is it possible to connect to sql server database with python core shell in Glue without using pip install pyodbc?
@AWSTutorialsOnline 3 ปีที่แล้ว
@@gowthamavinash9 Where is your SQL Server located?
@gowthamavinash9 3 ปีที่แล้ว
@@AWSTutorialsOnline it is a RDS database. I am able to connect to the db and import the tables using a crawler.
@AWSTutorialsOnline 3 ปีที่แล้ว
@@gowthamavinash9 ah ok. Then your job is very simple. Using PySpark itself, you can read data from SQL and write back to it. You don't need any other python module. Please have a look at these two labs related to Redshift and RDS, they will give you idea about what to do. aws-dojo.com/workshoplists/workshoplist33/
aws-dojo.com/workshoplists/workshoplist30/
Hope it helps. If you have more questions, please feel free to reach out to me.
@toshitmavle1707 3 ปีที่แล้ว ⁺¹
Hi
Can you give some code example to call POST Api instead of GET
We have requirement to call couple of rest post call (external) --
1. OAuth2 API
2. Service API
@AWSTutorialsOnline 3 ปีที่แล้ว
Does the below code helps?
import json
from botocore.vendored import requests
def lambda_handler(event, context):
# TODO Change URL to the private API URL
url = 'rest api url'
r = requests.post(url)
return {
'statusCode': 200,
'body': json.dumps(r.text)
}
@hsz7338 3 ปีที่แล้ว ⁺¹
Thank you for the tutorial. I have a question on What connection type we should use if we want to connect external Kafka such as Confluent Cloud Kafka?
@AWSTutorialsOnline 3 ปีที่แล้ว
Never worked with Kafka but I think for external Kafka, you can use Network type connection because all you need is outbound network connection to the external system. Please let me know if it helped.
@hsz7338 3 ปีที่แล้ว
@@AWSTutorialsOnline Thank you. That is what I am thinking of testing out. Because the Kafka Connection Type works for AWS MSK, not sure it works for external Kafka. I Will let you know the outcome either way.
@hsz7338 3 ปีที่แล้ว
@@AWSTutorialsOnline I couldn't make it work, it complains SSL hand_shake.
@AWSTutorialsOnline 3 ปีที่แล้ว
@@hsz7338 Hmm. For Kafka type connection, it does ask for certificate configuration. Could be external kafka is also demanding the same. Is there no way, you can provide specific certificate in the code. I am thinking to use network connection just to setup network access and then kafka-python for the python based access. like here - kafka-python.readthedocs.io/en/master/apidoc/BrokerConnection.html
@hsz7338 3 ปีที่แล้ว
@@AWSTutorialsOnline I have tried to use the Kafka connection previously. Confluent Cloud Kafka doesn't provide private CA (Confluent Platform does provide). The authentication protocol being SASL_SSL, which is the reason I had to use the Network Connection type in Glue ETL. Kafka-python package works fine to a local apache Kafka. But Confluent suggest using confluent-kafka-python in the Python Client application, for which I can't seem to make it work in a Glue ETL.
@pragtyagi6262 3 ปีที่แล้ว ⁺¹
I am getting error at the last step when running the Glue job although all steps i have performed in US-WEST 2 region . Even logs i am not able to see as it is showing log group is not available in the mentioned region. So i am not able to see the error logs also .
@AWSTutorialsOnline 3 ปีที่แล้ว
The workshop uses eu-west-3 region. Have you made sure that all the resources you created are in us-west-2 region? Also in the job code, eu-west-3 region is hard-coded as the parameter. Have you updated that as well? Please let me know,
@pragtyagi6262 3 ปีที่แล้ว
@@AWSTutorialsOnline YEs all my resources are in US-West-2 Oregon region. Also in my code i have replaced the S3 bucket name with my S3 bucket and region with us-west-2. Still it is failing . I checked and found that AWS GLUE is supported in us-west-2 . So not sure what is the issue . Any suggestions how to enable cloudwatch logs in this POC .
@AWSTutorialsOnline 3 ปีที่แล้ว
CloudWatch log is by default enabled. Do you have permission to see the log? What permission is there for your logged in account?
@pragtyagi6262 3 ปีที่แล้ว
@@AWSTutorialsOnline I assigned AWS administrator policy to my role which I created for this task along with S3 full access policy
@AWSTutorialsOnline 3 ปีที่แล้ว
Thanks for the details. That role is for Glue Job. That means - Glue Job has administrative permission. What permission your AWS logged in account has? Can it see the logs in general?
@sodiqafolayan4921 3 ปีที่แล้ว
Hi, Thank you for this cool workshop. However, i recreated this on my own using a custom VPC but the job never ran successfully. Below are the steps i followed and i will be glad if you can point out what else i need to do
1. I created EIP (to be used by NAT)
2. I created a VPC
3. I created Public and Private subnet
4. I created IGW and attach it to the VPC i created in step 2
5. I created a route table, add IGW as route and associate it with Public Subnet
6. I created NAT Gateway in the Public Subnet and associated the EIP created in step to it
7. I created another route table, open 0.0.0.0/0 route and made NAT the target, then i associate the route table with Private Subnet
8. I created IAM Role for Glue to access S3
9. I created s3 bucket
10. I created a dummy jdbc connection and put it inside the vpc and private subnet that i created above.
11. I created glue job accordingly but after running the job, it failed. Unfortunately, it did not give me any log and i can't understand the reason it failed.
Note that i edited the python script, used my created s3 bucket and changed the Region to the region i was working in but yet it did not work.
Obviously there is something i am not doing right but i couldn't figure it out.
I will appreciate your feedback
@AWSTutorialsOnline 3 ปีที่แล้ว
Hi - you configuration seems right. It seems your code is not able to call s3 bucket. Can you please check two things -
1) is your job able to make outbound call to internet? Try API provided by me and see it works. This will ensure that your job and VPC configuration is working.
2) if 1 works, then put focus on code to access S3 bucket. It might have some error. Please send me the code - I can check as well.
Hope it helps move forward.
@sodiqafolayan4921 3 ปีที่แล้ว
@@AWSTutorialsOnline I used exactly the code you provided using the API you provided. i only changed the bucket name region as expected. I also noticed that anytime i run the job, the code get stored in the bucket but i do not get the return text as expected.
My question: How do i specifically check if my job is able to make out bound call? I think what i have done in this regard is to create a route in the private subnet (0.0.0.0/0) and make NAT the target, and also i gave NAT access to the internet. So i think this should make it access the internet. I will be glad if there is any other way i can check if it can make internet call from the private subnet.
Once again, thank you so much for the awesome job. I already bookmarked about 5 of your workshops and turning them to mini projects. You have really helped me to learn
NB: The code i used
import requests
import boto3
URL = "jsonplaceholder.typicode.com/todos/1"
r = requests.get(url = URL)
s3_client = boto3.client('s3',region_name='i-inserted-my-region-here')
s3_client.put_object(Body=r.text, Bucket='i-have-my-bucket-name-here', Key='mydata.txt')
@AWSTutorialsOnline 3 ปีที่แล้ว
@@sodiqafolayan4921 Your n/w configuration looks good. One way to test is to launch an EC2 instance in the same private subnet and access the URL. Once URL access is confirmed. Few more things to check - 1) Glue Job and S3 bucket are in the same region. 2) Glue Job Role has permission to write to S3 bucket.
Please let me know if it helps.
@sodiqafolayan4921 3 ปีที่แล้ว
@@AWSTutorialsOnline Thank you so much. I was able to complete the workshop successfully. I am now replicating it with CFN
@AWSTutorialsOnline 3 ปีที่แล้ว
@@sodiqafolayan4921 Great. What fixed the problem?

ต่อไป

เล่นอัตโนมัติ

AWS Tutorials - Introduction to AWS Glue Studio