Thank you so much for the tutorial. Agree on the use of "external" libraries is a good practice to manage and maintain codebase at scale. A quick question on the parallel file writing to S3 from Glue ETL job (time 22.46), is it possible to configure the file size or file number in the Glue job to avoid having a massive amount of small objects in the S3 Data Lake?
My library contains only myscript.py file, i upload my myscript.py to S3bucket, and then when create Dev endpoint, i reference to S3bucket/Prefix/myscript.py in "Python library path" option. But in my Notebook "import myscript" still yields the error "ImportError: No module named myscript". I also try to place myscript.py file in folder call "customerlibs" and zip this folder into customerlibs.zip, but it didn't work either. Do you have any recommendations? Thanks
I think .py should be in the root of the zip file and then you can use "import myscript". if .py exists in customerlibs folder, then it should be "import customerlibs.myscript". Here is the documentation from AWS - docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html#aws-glue-programming-python-libraries-dev-endpoint
Hi! Great Tutorial ! i am working on spark shell is it still the packages in the .zip folder are available for usage or glue provides some basic packages for the usage? Thanks in advance for help.
use command - "pip install jira-module-name -t /path" to create a local package for the jira module. Zip the local package and upload to the S3 bucket. Finally refer the zip file S3 location as external library in the glue job. The Glue job role should have access to the S3 bucket where the module package in uploaded. Hope it helps.
@@AWSTutorialsOnline what a useful reply ... you really helped me with your reply I was searching for a simple way to import specific library "redshift_connector" to a aws glue job and with your reply you gave the hint to do it ... I installed it locally ... zipped all the dependecies not available already in Glue.3.0 and it worked
Well, you can do development and run locally (link below). But then you cannot use features like serverless run, scheduling and running with workflow. docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html
You’re amazing as always Brajendra 😃
Thank you so much 😀 Hope you are doing great
Great video, thank you!
Thank you so much for the tutorial. Agree on the use of "external" libraries is a good practice to manage and maintain codebase at scale. A quick question on the parallel file writing to S3 from Glue ETL job (time 22.46), is it possible to configure the file size or file number in the Glue job to avoid having a massive amount of small objects in the S3 Data Lake?
Hi, yes it is possible. Please check this link - survey.fieldsense.whs.amazon.dev/survey/3553ba63-b201-47fd-8f3e-46bfcc648192
@@AWSTutorialsOnline thank you. I’ll check it out.
Could you please share a link or any reference on how you created the Python zip file from the two Python programs you created? Thanks for your help.
Hope this link helps. docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html
Can you show an example of how to achieve the same in Glue PythonShell Job
My library contains only myscript.py file, i upload my myscript.py to S3bucket, and then when create Dev endpoint, i reference to S3bucket/Prefix/myscript.py in "Python library path" option. But in my Notebook "import myscript" still yields the error "ImportError: No module named myscript". I also try to place myscript.py file in folder call "customerlibs" and zip this folder into customerlibs.zip, but it didn't work either. Do you have any recommendations? Thanks
I think .py should be in the root of the zip file and then you can use "import myscript". if .py exists in customerlibs folder, then it should be "import customerlibs.myscript". Here is the documentation from AWS - docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html#aws-glue-programming-python-libraries-dev-endpoint
I tried to follow this tutorial and I placed a zip file into a S3 bucket but is always giving "ModuleNotFoundError: No module named" error...
I see this error when module is not packaged properly. You module files should be in the root of the zip package. Hope it helps.
This all with infrastructure as code would be amazing. This in Terraform, with a pipeline for deploying the zip
You can use Terraform, Cloudformation, CDK - whichever you want for infrastructure coding. Then use AWS developer tools to build the pipeline.
Hi! Great Tutorial ! i am working on spark shell is it still the packages in the .zip folder are available for usage or glue provides some basic packages for the usage? Thanks in advance for help.
.zip is good enough.
Can you please give a demo on how to connect to hadoop/hive data base using AWS glue
Glue Catalog is hive based. Please check my video where I talked about using PySpark to talk to Glue Catalog. Hope that helps.
@@AWSTutorialsOnline which video? please specify the name, thanks
Getting the error:Error downloading from S3 for bucket.Access Denied
Hi, how can i use external libraries such as jira in AWS Glue job?
use command - "pip install jira-module-name -t /path" to create a local package for the jira module. Zip the local package and upload to the S3 bucket. Finally refer the zip file S3 location as external library in the glue job. The Glue job role should have access to the S3 bucket where the module package in uploaded. Hope it helps.
@@AWSTutorialsOnline what a useful reply ... you really helped me with your reply I was searching for a simple way to import specific library "redshift_connector" to a aws glue job and with your reply you gave the hint to do it ... I installed it locally ... zipped all the dependecies not available already in Glue.3.0 and it worked
it's possible call a lambda from this script?
What about getting about to run this locally?
Well, you can do development and run locally (link below). But then you cannot use features like serverless run, scheduling and running with workflow. docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html
while importing i am getting no module error
you get error for the "import" statement?