AWS Tutorials - Working with Data Sources in AWS Glue Job

แชร์
ฝัง
  • เผยแพร่เมื่อ 4 พ.ย. 2024

ความคิดเห็น • 27

  • @shamstabrez2986
    @shamstabrez2986 2 ปีที่แล้ว

    ab ahiste ahiste ye tough hota jara no doubt detailed video h awesome video h bus ye lakeformation ko lekr br br confusion create hojara lake formation and data lake ko lekr

  • @hsz7338
    @hsz7338 3 ปีที่แล้ว +1

    Thank you so much for the video. I think it is very useful. I have a few questions, apologies if you might have mentioned them in other videos. 1. On Glue Crawler jobs, assume there is an ETL ingest a particular data source into the Data Lake (product a data file every time), what is your recommendation on the frequency of the glue crawler job: run every time when there is an ETL output file or once a day (if its ingestion frequency is high) to keep the cost low?
    2. On the Glue Crawler connection such as the Redshift Connection and the JDBC Connection (in the tutorial), can a single connection be used by multiple Glue jobs simultaneously, i.e. each Glue job create an instance or an object of the Connection?
    3. In the video, at time 38.54, a Glue job populated a table "employmentmini" to RDS database, I have not seen the primary key is created on the table into Postgres in the Notebook code. Does this mean that Postgres doesn't enforce the primary key on a table created by Glue job via Glue connection?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  3 ปีที่แล้ว +1

      Thanks for keep asking good question. I hope the audience is making use of it. Here are my response -
      1) Answer is - it depends. If you think, the schema is not going to change and you are confident about it, you can run crawler only once. Otherwise - frequency will depend upon you confidence about schema conformity.
      2) One connection is for one JDBC / Redshift endpoint only. But it can be used across multiple jobs even in parallel.
      3) Glue will not create PK. You can define table in RDS before hand with primary key and let Glue only load data into it.

  • @deepakbhardwaj9543
    @deepakbhardwaj9543 2 ปีที่แล้ว +2

    Hi Aws-tutorial, can you please help me in this below situation-
    Expected output
    -S3 to sql-server(hosted in window server)
    Questions 1)
    if daily, multiple files are coming in S3 then there would be the same number of table are going to be made in catalog by the crawler?
    Question2) To store ,all the daily data coming in S3 , do we need the as number of etl job as we have the catelog tables?
    I know, it's hard to reply to all the questions, I am just hoping you will reply me ! Thanks in advance ❤️

  • @deepakbhutekar5450
    @deepakbhutekar5450 ปีที่แล้ว

    I have one question , if using data catalog is recommended approach then how you handle daily load coming to data source using crawler.. I finding it difficult to handle daily load using glue crawler.

  • @SheetalPandrekar
    @SheetalPandrekar 2 ปีที่แล้ว +1

    The create_dataframe_from_catalog() does not have way to provide sql query when extracting data. From that we need to use spark. Im getting a communication link failure when I'm trying to read mysql with spark.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 ปีที่แล้ว

      There is no SQL option when fetching. But there is a workaround. You fetch as dataframe and then use SQL Transformation in Glue to filter down data the way you want.

    • @SheetalPandrekar
      @SheetalPandrekar 2 ปีที่แล้ว

      @@AWSTutorialsOnline When reading with spark we can provide SQL query

  • @imransadiq5851
    @imransadiq5851 ปีที่แล้ว

    Thank you for the amazing content. I want to ask if my data source is RDS postgres database. and i want to create connection to this data source from another AWS account, then how can i do this. Actually my data source is in another AWS account and am trying to connect it from another AWS account but its not working. Recommendations will be highly appreciated.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  ปีที่แล้ว

      You need to create network path VPC of Glue connection in one account and VPC of the database in other account. This you might do using VPC peering or Direct Connect. Once done, then you should be able to connect.

  • @swapnilkulkarni6719
    @swapnilkulkarni6719 2 ปีที่แล้ว +1

    Hi
    If cross account access is provided but classification of table is unknown (supposed to be parquet), how to handle this issue?
    Without classification,job throws error - No classification for table

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 ปีที่แล้ว +1

      are you trying to crawl S3 data from the different account? Please make sure bucket level access alows that. Also you have to be sure about the format. Parquet is generally very clean standard so if it is not able to identify the schema - it might be something else.

    • @swapnilkulkarni6719
      @swapnilkulkarni6719 2 ปีที่แล้ว

      @@AWSTutorialsOnline Thanks.. Found the issue,now able to access and use data using dynamic frames

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 ปีที่แล้ว

      @@swapnilkulkarni6719 Out of curiosity, what was the issue?

    • @swapnilkulkarni6719
      @swapnilkulkarni6719 2 ปีที่แล้ว +1

      @@AWSTutorialsOnline datalake admin access was missing,due to which there was No classification error

  • @sankarsb
    @sankarsb 2 ปีที่แล้ว

    Hi, I am trying to consume a Data catalog from a different AWS account into the current account and write a transformation to join both catalogs on a common ID field and store the outcome catalog into this current AWS account. Here is an example
    AWSAccount1 had DataCatalog1 and the AWSAccount2 (current AWS account) had DataCatalog2.
    I want to write a transformation with join as
    DataCatalog1.Table1.empid = DataCatalog2.Table2.empid
    and store this merged Datacatalog as Datacatalog3.Table3 in this current account.
    Basically, I want to merge the 2 data catalogs into a single bigger Data catalog.
    AWSAccount1 only shares its Data Catalog. We do not know much about the data internals.
    Is it possible to do this way? I hope we can do it. What are the activities I need to achieve this requirement? Your quick help in this is greatly appreciated. We can do this Athena, but we want to perform this activity in Glue Studio.

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 ปีที่แล้ว

      Please use cross account data catalog sharing. It will work.

  • @quezobars
    @quezobars 2 ปีที่แล้ว

    Hi How about for Sharepoint as a source? is it possible for AWS Glue job? And is it a jdbc connection or api?

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 ปีที่แล้ว

      Sharepoint is more like APIs. There are python libraries available which can talk to sharepoint and you can use them in Glue Job.

    • @quezobars
      @quezobars 2 ปีที่แล้ว +1

      @@AWSTutorialsOnline Yes I saw in one of your videos you used the request module for api. but is this also possible with sharepoint since it has its authentication details? Thank you for your videos by the way! Learning a lot!

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 ปีที่แล้ว

      @@quezobars I recommend not to use request module but some python library for sharepoint. Rest remains the same.

  • @shamstabrez2986
    @shamstabrez2986 2 ปีที่แล้ว

    n plz elaborate wht is dynamic frame n data frame

  • @veerachegu
    @veerachegu 2 ปีที่แล้ว

    Can you pls ping me

    • @AWSTutorialsOnline
      @AWSTutorialsOnline  2 ปีที่แล้ว

      Please tell me how can I help you

    • @veerachegu
      @veerachegu 2 ปีที่แล้ว

      @@AWSTutorialsOnline pls help me