How Do you Size Your Azure Databricks Clusters? Cluster Sizing Advice & Guidance in Azure Databricks

แชร์
ฝัง
  • เผยแพร่เมื่อ 20 ม.ค. 2025

ความคิดเห็น • 10

  • @RaghavC20
    @RaghavC20 3 ปีที่แล้ว

    Thanks for making short and useful video

  • @diogodallorto1
    @diogodallorto1 4 ปีที่แล้ว

    Really good class! Congratulations and thank you!
    You could make some class about catalyst-optimizer in Spark. Nobody explains it on youtube!

  • @muritech
    @muritech 3 ปีที่แล้ว +1

    Great video! In your opinion is it best to have one High Concurrency cluster shared among a few analysts (heavy Pandas users) or one small machine per user? I'm worried that even with a High Concurrency setup, I might end up only sharing the Driver capacity among the Data Analysts.

    • @AdvancingAnalytics
      @AdvancingAnalytics  3 ปีที่แล้ว +1

      With a cluster-per-user you end up paying way more as you're more likely to have under-utilised clusters and you're paying for a driver each time. Having one HC cluster means it has more power for any sudden spikes of heavy usage, can fit concurrent queries together to fully utilise the cluster, and only has the single driver. So from a cost perspective, definitely shared.
      One note on your users - make sure they're using Koalas over Pandas where possible to ensure they're getting the best scalability out of spark!
      Simon

  • @joyo2122
    @joyo2122 ปีที่แล้ว

    can you do a follow up on this video many things changed by now

  • @ikernarbaiza2138
    @ikernarbaiza2138 2 ปีที่แล้ว

    how does the pricing of the clusters works? or where could I find that information

  • @NasimaKhatun-jb7qo
    @NasimaKhatun-jb7qo 2 ปีที่แล้ว

    I see databricks is good for large dataset, what about data processing for few kbs. How it behaves in such scenerio

    • @AdvancingAnalytics
      @AdvancingAnalytics  2 ปีที่แล้ว

      It'll work, but there's always a small overhead for parallelism. So you'll find it slower than a traditional database for working with very small data, just because of that! Otherwise, it works fine, we often have very small datasets being processed alongside some huge ones!

  • @Sangeethsasidharanak
    @Sangeethsasidharanak 4 ปีที่แล้ว

    6.13 size of driver..could you please explain how largest dataset returned matter to ditermine the driver size.. because unless we call collect() executor will write to destination right?

  • @Prashanth-yj6qx
    @Prashanth-yj6qx 5 ปีที่แล้ว

    I have 800GB dataset...how do i configure my cluster size?..