How does query processing work in BigQuery?

แชร์
ฝัง
  • เผยแพร่เมื่อ 15 ก.ย. 2024

ความคิดเห็น • 22

  • @divertechnology
    @divertechnology 3 ปีที่แล้ว +4

    I am fascinated by the voice, there was a cleaning process or something that makes it sound so clear.
    I love that she give deep information on how it works, that's how I understand something well.

  •  3 ปีที่แล้ว +4

    Execution details tab helped a lot, I had to refactor some legacy queries with more than >600 stages to a cleaner and more optimized version. Love your product guys

  • @AvinashSingh-vj3rk
    @AvinashSingh-vj3rk 3 ปีที่แล้ว +3

    Nice video 👍

  • @ankitlakum1
    @ankitlakum1 3 ปีที่แล้ว +3

    Thanks 😊

  • @arjunk5959
    @arjunk5959 3 ปีที่แล้ว +2

    Nice info !!

  • @dheer211
    @dheer211 2 ปีที่แล้ว +1

    i remember seeing an internal architecture diagram for BQ that comprises of shuffle, dremel, networking fabric (sorry forgot its name) can someone point me to that google blog or video , thanks?

  • @jaimemcarlosa
    @jaimemcarlosa 3 ปีที่แล้ว +3

    I 💙 Google BigQuery

  • @jaeseokpark6241
    @jaeseokpark6241 3 ปีที่แล้ว +3

    Wonderful!!!

  • @gabrieldjebbar7098
    @gabrieldjebbar7098 2 ปีที่แล้ว +2

    Great !

  • @majorcemp3612
    @majorcemp3612 3 ปีที่แล้ว +2

    So ... On your exemple there is Data skew because of the wait avg being lower than max average and read average being lower than max read ? How would you improve that ? (adding slots ? if yes, how many more ?)

    • @leighajarett221
      @leighajarett221 3 ปีที่แล้ว +2

      I would suggest trying to filter the data to get a more uniform distribution!

    • @majorcemp3612
      @majorcemp3612 3 ปีที่แล้ว +1

      @@leighajarett221 there is already a where like filtering, so what other filters would you use ? Or what else would you use ? Also here we take into account that the difference between avg and max is on the "wait" and "read" phase, so what could be the problem ?

    • @leighajarett221
      @leighajarett221 3 ปีที่แล้ว +3

      @@majorcemp3612 Ti get it more uniform you can try looking at the data to understand the distribution and then filtering the data to get rid of the "tail end" of the curve. For example, if I have a range of values from 0 to 100 and most of my rows have 95-100 this might overwhelm the slots that are processing data with those keys. Instead, you could try filtering the data to focus on just the subset of that information you need (e.g. I only care about values with 95 or above). But that might not always be possible depending on the question you are asking.
      Alternatively, you can split this up into two different queries - one where you analyze the information 0-95 and then other 95-100 so each has a more uniform distribution of that key. Hope that helps!

    • @shatakshiagrawal3062
      @shatakshiagrawal3062 ปีที่แล้ว

      @@leighajarett221 great expanation!

  • @cslearner582
    @cslearner582 ปีที่แล้ว +1

    Great video. One question: is the 'slot' and 'worker' the same in this context?

  • @RATANAGARWALITINFORMER
    @RATANAGARWALITINFORMER 3 ปีที่แล้ว +2

    GOOD

  • @RazvanCristianLung
    @RazvanCristianLung 3 ปีที่แล้ว

    Streams? Why can't we delete newly added rows?

    • @leighajarett221
      @leighajarett221 3 ปีที่แล้ว +1

      These videos take a few months to produce but don't worry, it's on my list!

    • @RazvanCristianLung
      @RazvanCristianLung 3 ปีที่แล้ว

      @@leighajarett221 thank you

    •  3 ปีที่แล้ว

      Just by the complexity of the architecture and number of caches involved is reasonable to tell that deleting a recent data is a really expensive operation and can slow down the query mechanism as a hole. There are a few workarounds that I found to bypass this issue. Try using updated_at timestamps to get more up-date-versions of a certain record or using materialized views with the filtered data.