Databricks Unity Catalog : Setup and Demo on AWS

แชร์
ฝัง
  • เผยแพร่เมื่อ 2 ต.ค. 2024
  • Learn as we walk through step by step how to start your Lakehouse journey with Databricks Unity Catalog on Amazon Web Services (AWS). In this video we'll go through the entire process from creating the S3 bucket and writing IAM policy, to creating the Unity Catalog Metastore and demonstrating it in action with Databricks SQL!
    Unity Catalog is a product on Databricks that unifies data governance on the Databricks platform, enabling your organization to develop strong access control on its data, analytics, and AI. Beyond access control lists, Unity Catalog also provides a number of other useful features such as Data Lineage to track where your data assets are being used from both upstream and downstream.
    Link to Unity Catalog overview:
    docs.databrick...
    Documentation to get started with Unity Catalog (for the IAM policy snippets from the video, please see the following link):
    docs.databrick...
    If you prefer deploying your cloud infrastructure as code, check out the following guide on setting up everything you need for Unity Catalog using Terraform!
    registry.terra...
    Thanks again for watching!

ความคิดเห็น • 15

  • @AthenaMao
    @AthenaMao 4 หลายเดือนก่อน +1

    Where can I find the json template of custom trust policy

  • @lostfrequency89
    @lostfrequency89 2 หลายเดือนก่อน

    Is it possible to create volumes on top of this external storage container ?

  • @ft_angel91
    @ft_angel91 8 หลายเดือนก่อน +2

    By far the best tutorial I've seen. Thank you for putting this out.

    • @kunnunhs1
      @kunnunhs1 4 หลายเดือนก่อน

      it's worst unclear

  • @chaitanyamuvva
    @chaitanyamuvva 3 หลายเดือนก่อน

    Thanks for posting!! much needed stuff.

  • @rajanzb
    @rajanzb 7 หลายเดือนก่อน

    Wonderful demo. Have a question, where did you link up the UnityCalatog created in Metastore to the catalog on the data explorer? How is the s3 bucket attached to this table created in the schema of dev catalog? Please clarify.

    • @MakeWithData
      @MakeWithData  7 หลายเดือนก่อน

      Thanks! Metastores are assigned to the workspace at the account level, then any catalogs you create in the workspace are automatically associated with that metastore, and you can also only have one metastore assigned to a workspace. When you create a metastore, you must configure a default S3 bucket for the metastore, so your schemas/tables/etc will be stored in that bucket by default; however you can also setup additional buckets as "External Locations" in UC and then use those as the default root storage location for specific catalogs or schemas you create. Hope this helps!

  • @NdKe-j3k
    @NdKe-j3k 11 หลายเดือนก่อน

    Thank you for the video.
    I have a large(~15gb) csv file in s3. how can i process that data in databricks. I dont want to mount the s3 bucket. Is there any way i can process this file in databricks other than mounting it?

    • @MakeWithData
      @MakeWithData  11 หลายเดือนก่อน

      Yes, no need to mount your bucket, you can read that from a pyspark or scala notebook in databricks with spark.read.csv("s3://path/to/data")
      15GB for a single file is quite large though, I would recommend trying to split it up into multiple smaller files if possible, so that you can realize maximum parallelism from your Spark cluster. Ideally you can even convert that to Delta Lake format. If you don't split it up or convert it, you may need a cluster with more memory available.

  • @hassanumair6967
    @hassanumair6967 ปีที่แล้ว

    and what if we want to create volume,
    I am stuck while doing databricks configuration with AWS and using demo version of premium
    The problem where i have been stuck is default metastore which occurs every time when i try to create volume.

    • @MakeWithData
      @MakeWithData  ปีที่แล้ว +2

      Hi, I recommend submitting a Question to stackoverflow using the [databricks] tag. Myself and several others are very active in that forum and would be happy to help, given more details about your use case! Thank you for watching!

  • @aaronwong8533
    @aaronwong8533 ปีที่แล้ว

    This is so helpful! Thank you for posting.

  • @hassanumair6967
    @hassanumair6967 ปีที่แล้ว +1

    Another suggestion if you want to made that tutorial type video that would be great.
    This video cover backup and restoration of databricks like what we save in our S3 and what are parallel methods. Restoration policies specifically if we use geo-redundant structure with wide number of users.

  • @SaurabhKumar-ic7nt
    @SaurabhKumar-ic7nt ปีที่แล้ว

    awesome explanation